pray-calc-ml/notebooks/01_exploratory_analysis.ipynb
Aric Camarata 6e0f4a679c Rebuild as Python data science project
Replaces the original JS calibration library with a pure Python pipeline
for collecting and back-calculating solar depression angles from human-verified
Fajr and Isha prayer sightings.

What this does:
- src/pipeline.py: master pipeline; fetches iCal + manual records, back-calculates
  angles via PyEphem, applies quality filters, exports two clean CSVs
- src/collect/openfajr.py: parses the OpenFajr Birmingham iCal feed (~4,018 records)
- src/collect/verified_sightings.py: manually compiled records from peer-reviewed
  studies (Egypt, Saudi Arabia, Malaysia, Indonesia, UK, USA, Canada, and more)
- src/angle_calc.py: PyEphem back-calculation with atmospheric refraction
- src/elevation.py: Open-Elevation API batch lookup

Datasets generated:
- data/processed/fajr_angles.csv: 4,105 confirmed Fajr records, 35 locations,
  latitude range -37.8 to 53.7 degrees, date range 1985-2026
- data/processed/isha_angles.csv: 43 confirmed Isha records, 20+ locations

Also includes:
- notebooks/01_exploratory_analysis.ipynb: latitude, TOY, elevation pattern analysis
- research/: academic paper summaries (not training data)
- data/raw/sources.md: full citation table for all data sources
2026-02-25 19:32:47 -05:00

481 lines
19 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fajr and Isha Angle: Exploratory Analysis\n",
"\n",
"This notebook explores the compiled datasets of verified human sightings to find patterns\n",
"in how the solar depression angle at Fajr and Isha varies with:\n",
"\n",
"- **Latitude** — distance from the equator\n",
"- **Day of Year (TOY)** — seasonality\n",
"- **Elevation** — metres above sea level\n",
"\n",
"Run the pipeline first:\n",
"```bash\n",
"python -m src.pipeline --no-elevation-lookup\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.ticker as ticker\n",
"from pathlib import Path\n",
"\n",
"ROOT = Path.cwd().parent\n",
"\n",
"fajr = pd.read_csv(ROOT / 'data/processed/fajr_angles.csv', parse_dates=['utc_dt'])\n",
"isha = pd.read_csv(ROOT / 'data/processed/isha_angles.csv', parse_dates=['utc_dt'])\n",
"\n",
"print(f'Fajr records: {len(fajr)}')\n",
"print(f'Isha records: {len(isha)}')\n",
"print(f'Fajr latitude range: {fajr[\"lat\"].min():.1f}° to {fajr[\"lat\"].max():.1f}°')\n",
"print(f'Fajr date range: {fajr[\"date\"].min()} to {fajr[\"date\"].max()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Angle Distribution Overview"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
"\n",
"axes[0].hist(fajr['fajr_angle'], bins=60, color='steelblue', alpha=0.8, edgecolor='white')\n",
"axes[0].axvline(fajr['fajr_angle'].mean(), color='red', linestyle='--', label=f'Mean {fajr[\"fajr_angle\"].mean():.2f}°')\n",
"axes[0].axvline(fajr['fajr_angle'].median(), color='orange', linestyle='--', label=f'Median {fajr[\"fajr_angle\"].median():.2f}°')\n",
"axes[0].set_xlabel('Solar Depression Angle (°)')\n",
"axes[0].set_ylabel('Count')\n",
"axes[0].set_title(f'Fajr Angle Distribution (n={len(fajr):,})')\n",
"axes[0].legend()\n",
"\n",
"if len(isha) > 0:\n",
" axes[1].hist(isha['isha_angle'], bins=20, color='darkorange', alpha=0.8, edgecolor='white')\n",
" axes[1].axvline(isha['isha_angle'].mean(), color='red', linestyle='--', label=f'Mean {isha[\"isha_angle\"].mean():.2f}°')\n",
" axes[1].set_xlabel('Solar Depression Angle (°)')\n",
" axes[1].set_ylabel('Count')\n",
" axes[1].set_title(f'Isha Angle Distribution (n={len(isha):,})')\n",
" axes[1].legend()\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/angle_distribution.png', dpi=150, bbox_inches='tight')\n",
"plt.show()\n",
"\n",
"print('\\nFajr angle percentiles:')\n",
"print(fajr['fajr_angle'].describe().to_string())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Latitude vs Fajr Angle"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(figsize=(14, 6))\n",
"\n",
"# Scatter for non-Birmingham records (smaller dataset, more geographic variety)\n",
"bham = fajr[fajr['lat'].between(52.4, 52.5)]\n",
"other = fajr[~fajr['lat'].between(52.4, 52.5)]\n",
"\n",
"ax.scatter(bham['lat'], bham['fajr_angle'], alpha=0.1, s=8, color='steelblue', label=f'Birmingham OpenFajr (n={len(bham):,})')\n",
"ax.scatter(other['lat'], other['fajr_angle'], alpha=0.8, s=40, color='red', zorder=5, label=f'Other locations (n={len(other):,})')\n",
"\n",
"# Mean by latitude band\n",
"fajr['lat_band'] = (fajr['lat'] / 5).round() * 5 # round to nearest 5°\n",
"band_means = fajr.groupby('lat_band')['fajr_angle'].mean()\n",
"ax.plot(band_means.index, band_means.values, 'k--', linewidth=2, label='Band mean (5° bins)')\n",
"\n",
"ax.set_xlabel('Latitude (°)')\n",
"ax.set_ylabel('Fajr Depression Angle (°)')\n",
"ax.set_title('Fajr Angle vs Latitude')\n",
"ax.legend()\n",
"ax.grid(True, alpha=0.3)\n",
"ax.axhline(fajr['fajr_angle'].mean(), color='gray', linestyle=':', alpha=0.5, label='Overall mean')\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/fajr_vs_latitude.png', dpi=150, bbox_inches='tight')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Seasonality (Day of Year) vs Fajr Angle — Birmingham"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Birmingham has 4,000+ records — ideal for TOY analysis\n",
"bham = fajr[fajr['lat'].between(52.4, 52.5)].copy()\n",
"\n",
"fig, axes = plt.subplots(2, 1, figsize=(14, 10))\n",
"\n",
"# Raw scatter\n",
"axes[0].scatter(bham['day_of_year'], bham['fajr_angle'], alpha=0.3, s=5, color='steelblue')\n",
"axes[0].set_xlabel('Day of Year')\n",
"axes[0].set_ylabel('Fajr Depression Angle (°)')\n",
"axes[0].set_title('Birmingham Fajr Angle vs Day of Year (raw)')\n",
"axes[0].set_xticks([1, 60, 121, 182, 244, 305, 365])\n",
"axes[0].set_xticklabels(['Jan', 'Mar', 'May', 'Jul', 'Sep', 'Nov', 'Dec'])\n",
"axes[0].grid(True, alpha=0.3)\n",
"\n",
"# Rolling mean (30-day window)\n",
"bham_sorted = bham.sort_values('day_of_year')\n",
"bham_sorted['rolling_mean'] = bham_sorted['fajr_angle'].rolling(window=30, center=True).mean()\n",
"axes[1].scatter(bham_sorted['day_of_year'], bham_sorted['fajr_angle'], alpha=0.15, s=5, color='steelblue')\n",
"axes[1].plot(bham_sorted['day_of_year'], bham_sorted['rolling_mean'], 'r-', linewidth=2, label='30-day rolling mean')\n",
"axes[1].axhline(bham['fajr_angle'].mean(), color='gray', linestyle='--', label=f'Overall mean {bham[\"fajr_angle\"].mean():.2f}°')\n",
"axes[1].set_xlabel('Day of Year')\n",
"axes[1].set_ylabel('Fajr Depression Angle (°)')\n",
"axes[1].set_title('Birmingham Fajr Angle vs Day of Year (smoothed)')\n",
"axes[1].set_xticks([1, 60, 121, 182, 244, 305, 365])\n",
"axes[1].set_xticklabels(['Jan', 'Mar', 'May', 'Jul', 'Sep', 'Nov', 'Dec'])\n",
"axes[1].legend()\n",
"axes[1].grid(True, alpha=0.3)\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/birmingham_seasonality.png', dpi=150, bbox_inches='tight')\n",
"plt.show()\n",
"\n",
"# Stats by season\n",
"bham['season'] = pd.cut(bham['day_of_year'],\n",
" bins=[0, 80, 172, 266, 355, 366],\n",
" labels=['Winter', 'Spring', 'Summer', 'Autumn', 'Winter2'])\n",
"print('Birmingham Fajr angle by season:')\n",
"print(bham.groupby('season')['fajr_angle'].describe().to_string())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Latitude × Season Interaction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# For non-Birmingham locations with per-season data\n",
"other = fajr[~fajr['lat'].between(52.4, 52.5)].copy()\n",
"\n",
"fig, ax = plt.subplots(figsize=(14, 6))\n",
"\n",
"scatter = ax.scatter(other['day_of_year'], other['fajr_angle'],\n",
" c=other['lat'], cmap='RdYlBu', s=80, alpha=0.8,\n",
" vmin=-40, vmax=55)\n",
"\n",
"cbar = plt.colorbar(scatter, ax=ax)\n",
"cbar.set_label('Latitude (°)')\n",
"ax.set_xlabel('Day of Year')\n",
"ax.set_ylabel('Fajr Depression Angle (°)')\n",
"ax.set_title('Fajr Angle vs Season, colored by Latitude')\n",
"ax.set_xticks([1, 60, 121, 182, 244, 305, 365])\n",
"ax.set_xticklabels(['Jan', 'Mar', 'May', 'Jul', 'Sep', 'Nov', 'Dec'])\n",
"ax.grid(True, alpha=0.3)\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/lat_season_interaction.png', dpi=150, bbox_inches='tight')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Elevation vs Fajr Angle"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Elevation effect — compare sites with different elevations at similar latitudes\n",
"other = fajr[~fajr['lat'].between(52.4, 52.5)].copy()\n",
"\n",
"fig, ax = plt.subplots(figsize=(10, 6))\n",
"\n",
"scatter = ax.scatter(other['elevation_m'], other['fajr_angle'],\n",
" c=other['lat'].abs(), cmap='viridis', s=80, alpha=0.8)\n",
"\n",
"cbar = plt.colorbar(scatter, ax=ax)\n",
"cbar.set_label('|Latitude| (°)')\n",
"ax.set_xlabel('Elevation (m)')\n",
"ax.set_ylabel('Fajr Depression Angle (°)')\n",
"ax.set_title('Fajr Angle vs Elevation')\n",
"ax.grid(True, alpha=0.3)\n",
"\n",
"# Correlation\n",
"corr = other[['elevation_m', 'fajr_angle']].corr().iloc[0, 1]\n",
"ax.text(0.05, 0.95, f'Pearson r = {corr:.3f}', transform=ax.transAxes,\n",
" fontsize=12, verticalalignment='top')\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/elevation_effect.png', dpi=150, bbox_inches='tight')\n",
"plt.show()\n",
"\n",
"print(f'Elevation vs Fajr angle correlation: {corr:.3f}')\n",
"\n",
"# Key elevation comparisons\n",
"print('\\nHigh-elevation sites (>500m):')\n",
"high_elev = other[other['elevation_m'] > 500].groupby(['lat', 'elevation_m'])['fajr_angle'].mean()\n",
"print(high_elev.to_string())\n",
"\n",
"print('\\nLow-elevation sites (<50m):')\n",
"low_elev = other[other['elevation_m'] < 50].groupby(['lat', 'elevation_m'])['fajr_angle'].mean()\n",
"print(low_elev.to_string())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Geographic Coverage Map"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Site coverage summary\n",
"all_data = pd.concat([\n",
" fajr[['lat', 'lng', 'elevation_m', 'source']].assign(prayer='fajr'),\n",
" isha[['lat', 'lng', 'elevation_m', 'source']].assign(prayer='isha'),\n",
"])\n",
"\n",
"sites = all_data.groupby(['lat', 'lng', 'elevation_m']).agg(\n",
" n_fajr=('prayer', lambda x: (x == 'fajr').sum()),\n",
" n_isha=('prayer', lambda x: (x == 'isha').sum()),\n",
").reset_index()\n",
"\n",
"print(f'Unique observation sites: {len(sites)}')\n",
"print(f'Latitude range: {sites[\"lat\"].min():.2f}° to {sites[\"lat\"].max():.2f}°')\n",
"print()\n",
"print('Sites with most records:')\n",
"print(sites.sort_values('n_fajr', ascending=False).head(10).to_string())\n",
"\n",
"fig, ax = plt.subplots(figsize=(16, 8))\n",
"sc = ax.scatter(sites['lng'], sites['lat'],\n",
" s=np.sqrt(sites['n_fajr'] + sites['n_isha']) * 8 + 20,\n",
" c=sites['lat'], cmap='RdYlBu', alpha=0.8, edgecolors='black', linewidth=0.5)\n",
"cbar = plt.colorbar(sc, ax=ax)\n",
"cbar.set_label('Latitude (°)')\n",
"ax.set_xlabel('Longitude (°)')\n",
"ax.set_ylabel('Latitude (°)')\n",
"ax.set_title('Observation Sites (bubble size = record count)')\n",
"ax.axhline(0, color='gray', linestyle='--', alpha=0.5, linewidth=0.8)\n",
"ax.grid(True, alpha=0.3)\n",
"ax.set_xlim(-100, 185)\n",
"ax.set_ylim(-45, 65)\n",
"\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/site_map.png', dpi=150, bbox_inches='tight')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Simple Linear Regression: Fajr Angle ~ f(lat, day_of_year, elevation)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics import r2_score, mean_absolute_error\n",
"import numpy as np\n",
"\n",
"# Use all data\n",
"features = ['lat', 'day_of_year', 'elevation_m']\n",
"X = fajr[features].copy()\n",
"\n",
"# Add squared terms for non-linearity\n",
"X['lat_abs'] = fajr['lat'].abs()\n",
"X['lat_sq'] = fajr['lat'] ** 2\n",
"X['doy_sin'] = np.sin(2 * np.pi * fajr['day_of_year'] / 365.25)\n",
"X['doy_cos'] = np.cos(2 * np.pi * fajr['day_of_year'] / 365.25)\n",
"X['doy_sin2'] = np.sin(4 * np.pi * fajr['day_of_year'] / 365.25)\n",
"X['doy_cos2'] = np.cos(4 * np.pi * fajr['day_of_year'] / 365.25)\n",
"\n",
"y = fajr['fajr_angle']\n",
"\n",
"scaler = StandardScaler()\n",
"X_scaled = scaler.fit_transform(X)\n",
"\n",
"model = LinearRegression()\n",
"model.fit(X_scaled, y)\n",
"y_pred = model.predict(X_scaled)\n",
"\n",
"print(f'R² = {r2_score(y, y_pred):.4f}')\n",
"print(f'MAE = {mean_absolute_error(y, y_pred):.4f}°')\n",
"print()\n",
"print('Feature coefficients:')\n",
"for feat, coef in zip(X.columns, model.coef_):\n",
" print(f' {feat:15s}: {coef:.4f}')\n",
"\n",
"# Residuals\n",
"residuals = y - y_pred\n",
"print(f'\\nResidual stats:')\n",
"print(residuals.describe().to_string())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Residual plot\n",
"fig, axes = plt.subplots(1, 3, figsize=(16, 5))\n",
"\n",
"axes[0].scatter(fajr['lat'], residuals, alpha=0.1, s=5)\n",
"axes[0].axhline(0, color='red', linestyle='--')\n",
"axes[0].set_xlabel('Latitude')\n",
"axes[0].set_ylabel('Residual (°)')\n",
"axes[0].set_title('Residuals vs Latitude')\n",
"\n",
"axes[1].scatter(fajr['day_of_year'], residuals, alpha=0.1, s=5)\n",
"axes[1].axhline(0, color='red', linestyle='--')\n",
"axes[1].set_xlabel('Day of Year')\n",
"axes[1].set_ylabel('Residual (°)')\n",
"axes[1].set_title('Residuals vs Day of Year')\n",
"\n",
"axes[2].scatter(fajr['elevation_m'], residuals, alpha=0.3, s=20)\n",
"axes[2].axhline(0, color='red', linestyle='--')\n",
"axes[2].set_xlabel('Elevation (m)')\n",
"axes[2].set_ylabel('Residual (°)')\n",
"axes[2].set_title('Residuals vs Elevation')\n",
"\n",
"plt.suptitle('Linear Regression Residuals')\n",
"plt.tight_layout()\n",
"plt.savefig(ROOT / 'data/processed/regression_residuals.png', dpi=150, bbox_inches='tight')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Isha Angle Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if len(isha) > 0:\n",
" fig, axes = plt.subplots(1, 3, figsize=(16, 5))\n",
"\n",
" axes[0].scatter(isha['lat'], isha['isha_angle'], color='darkorange', alpha=0.8, s=60)\n",
" axes[0].set_xlabel('Latitude (°)')\n",
" axes[0].set_ylabel('Isha Depression Angle (°)')\n",
" axes[0].set_title('Isha Angle vs Latitude')\n",
" axes[0].grid(True, alpha=0.3)\n",
"\n",
" axes[1].scatter(isha['day_of_year'], isha['isha_angle'], color='darkorange', alpha=0.8, s=60)\n",
" axes[1].set_xlabel('Day of Year')\n",
" axes[1].set_ylabel('Isha Depression Angle (°)')\n",
" axes[1].set_title('Isha Angle vs Season')\n",
" axes[1].set_xticks([1, 60, 121, 182, 244, 305, 365])\n",
" axes[1].set_xticklabels(['Jan', 'Mar', 'May', 'Jul', 'Sep', 'Nov', 'Dec'])\n",
" axes[1].grid(True, alpha=0.3)\n",
"\n",
" axes[2].scatter(isha['elevation_m'], isha['isha_angle'], color='darkorange', alpha=0.8, s=60)\n",
" axes[2].set_xlabel('Elevation (m)')\n",
" axes[2].set_ylabel('Isha Depression Angle (°)')\n",
" axes[2].set_title('Isha Angle vs Elevation')\n",
" axes[2].grid(True, alpha=0.3)\n",
"\n",
" plt.suptitle(f'Isha Analysis (n={len(isha)} records)')\n",
" plt.tight_layout()\n",
" plt.savefig(ROOT / 'data/processed/isha_analysis.png', dpi=150, bbox_inches='tight')\n",
" plt.show()\n",
"\n",
" print('Isha angle stats by latitude band:')\n",
" isha['lat_band'] = pd.cut(isha['lat'], bins=[-40, -10, 10, 30, 45, 60],\n",
" labels=['30-40°S', '10°S-10°N', '10-30°N', '30-45°N', '45-60°N'])\n",
" print(isha.groupby('lat_band')['isha_angle'].describe().to_string())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. Summary and Hypotheses for ML\n",
"\n",
"### Observed patterns:\n",
"\n",
"1. **Latitude effect**: Near-equatorial sites (Malaysia, Indonesia, 2°-7°) show higher Fajr angles (~16°-17°) compared to mid-latitude sites (UK ~13°, Egypt ~14°). This is counter-intuitive but physically explainable: the sun's arc through the horizon zone is steeper at low latitudes, so each degree of depression corresponds to a shorter time interval.\n",
"\n",
"2. **Seasonality (TOY)**: At fixed latitude, Fajr angle is lower in summer than winter. This is clear in the Birmingham dataset (10+ years of data). Summer twilight is shorter and the sun's path through the horizon zone is shallower.\n",
"\n",
"3. **Elevation**: Higher-elevation sites tend toward slightly higher angles. Desert observatory sites (Kottamia 477m, Hail 1020m, Tehran 1191m) show angles on the higher end. This is consistent with the physical effect: elevated observers see through less atmosphere, so the first light of dawn appears at a slightly steeper angle.\n",
"\n",
"4. **Latitude × Season interaction**: The seasonal swing is larger at high latitudes (Birmingham has a ~3° range from summer to winter) and smaller at equatorial sites (Malaysian sites show < 1° seasonal variation).\n",
"\n",
"### Next steps for ML:\n",
"\n",
"- Train gradient boosted models (XGBoost, LightGBM) on all available data\n",
"- Key features: `lat`, `lat_abs`, `lat_sq`, `day_of_year`, `doy_sin`, `doy_cos`, `elevation_m`, `lat × doy_sin`, `lat × doy_cos`\n",
"- Expand Isha dataset (currently only 43 records) before training Isha model\n",
"- Outlier analysis: identify records that deviate significantly from the fitted model and investigate whether they represent data entry errors, unusual atmospheric conditions, or genuine outliers\n",
"- Cross-validation strategy: leave-one-location-out (not random split) to test generalization to unseen locations"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}