refactor: split verified_sightings into data/models/analysis modules + tests (E8); apply standardization (E2/E11)

This commit is contained in:
Aric Camarata 2026-05-30 18:39:49 -04:00
parent 31cbadc2c4
commit 8fc9c8ded3
23 changed files with 9252 additions and 6022 deletions

View file

@ -53,8 +53,8 @@ This does three things in sequence:
Output:
```
data/processed/fajr_angles.csv — 5,871 Fajr records
data/processed/isha_angles.csv — 46 Isha records
data/processed/fajr_angles.csv — 48,668 Fajr records
data/processed/isha_angles.csv — 34,529 Isha records
```
### Without elevation lookup
@ -83,8 +83,8 @@ Computing solar depression angles...
Dropping N record(s) with implausible angles (< 7.0° Fajr / < 10.0° Isha):
...
Fajr dataset: 5871 records → data/processed/fajr_angles.csv
Isha dataset: 46 records → data/processed/isha_angles.csv
Fajr dataset: 48668 records → data/processed/fajr_angles.csv
Isha dataset: 34529 records → data/processed/isha_angles.csv
```
Records dropped with "implausible angles" are data entry or DST-transition artifacts. The

View file

@ -144,7 +144,7 @@ where the Fajr depression angle was determined by linear fitting of SQM time-ser
## Source Quality Summary
Dataset counts as of pipeline run 2026-02-28: **5,871 Fajr + 46 Isha**
Dataset counts as of pipeline run 2026-05-30: **48,668 Fajr + 34,529 Isha**
| Tier | Description | Sources | Approx count |
| --- | --- | --- | --- |

View file

@ -10,8 +10,8 @@ For per-source citations, see [Data Sources](Data-Sources).
| Dataset | Records | Unique Locations | Latitude Range | Date Range | Last Updated |
| --- | --- | --- | --- | --- | --- |
| `fajr_angles.csv` | 5,871 | 110 | -9.6° to 53.8° | 1970-2026 | 2026-02-28 |
| `isha_angles.csv` | 46 | 5 | -33.9° to 53.8° | 1985-2022 | 2026-02-28 |
| `fajr_angles.csv` | 48,668 | 4,200+ | -62.6° to 69.7° | 1970-2026 | 2026-05-30 |
| `isha_angles.csv` | 34,529 | 2,800+ | -65.9° to 69.3° | 1985-2026 | 2026-05-30 |
Target: 100,000 Fajr + 100,000 Isha records.

View file

@ -36,10 +36,10 @@ Output: `data/processed/fajr_angles.csv` and `data/processed/isha_angles.csv`
| Dataset | Records | Locations | Latitude range | Date range |
| --- | --- | --- | --- | --- |
| Fajr | 5,871 | 110 | -9.6° to 53.8° | 1970-2026 |
| Isha | 46 | 5 | -33.9° to 53.8° | 1985-2022 |
| Fajr | 48,668 | 4,200+ | -62.6° to 69.7° | 1970-2026 |
| Isha | 34,529 | 2,800+ | -65.9° to 69.3° | 1985-2026 |
Total: 5,917 unique sighting records across 114 unique locations in 15+ countries.
Total: 83,197 unique sighting records across 4,200+ unique locations in 50+ countries.
Target: 100,000 per dataset. Only human-verified observational records are included. Computed prayer times (Aladhan API, JAKIM e-Solat, fixed-angle schedules) are excluded from the pipeline and stored in `data/raw/excluded/` for reference.

View file

@ -149,7 +149,7 @@ density within 50km, or average nighttime luminance from VIIRS/DMSP.
### 2. Isha Data Scarcity
Only 46 verified Isha (Shafaq al-Abyad) records exist in the dataset, compared to 5,871 Fajr.
Only 34,529 verified Isha (Shafaq al-Abyad) records exist in the dataset, compared to 48,668 Fajr.
Most published Isha studies either use Shafaq al-Ahmar (different criterion, lower angle) or
report aggregate D0 only. The Isha model cannot be meaningfully trained without more data.
@ -234,7 +234,7 @@ The published literature broadly agrees on these points:
## Data Gaps by Region
As of 2026-02-28 (5,871 Fajr / 46 Isha across 114 locations):
As of 2026-05-30 (48,668 Fajr / 34,529 Isha across 4,200+ locations):
| Region | Observed Coverage | Priority |
| --- | --- | --- |

View file

@ -40,8 +40,8 @@ Two clean CSV files are generated by the pipeline:
### Current dataset size
- **Fajr:** 5,871 records, 110 unique locations, latitude range -9.6° to 53.8°
- **Isha:** 46 records, 5 unique locations, latitude range -33.9° to 53.8°
- **Fajr:** 48,668 records, 4,200+ unique locations, latitude range -62.6° to 69.7°
- **Isha:** 34,529 records, 2,800+ unique locations, latitude range -65.9° to 69.3°
- **Date range:** 1970 to 2026
The dominant Fajr source is the [OpenFajr Project](https://openfajr.org), with 4,000+
@ -92,7 +92,12 @@ pray-calc-ml/
│ └── pdf_extractor.py PDF text extraction via PyMuPDF + pdfminer
├── data/
│ ├── raw/raw_sightings/ Per-source raw CSV files
│ └── processed/ Generated CSVs (fajr_angles.csv, isha_angles.csv)
│ ├── processed/ Generated CSVs (fajr_angles.csv, isha_angles.csv)
│ └── SCHEMA.md Column-by-column documentation for both processed CSVs
├── docs/
│ └── ml-training-plan.md Feature engineering, model architecture, CV strategy, metrics
├── src/
│ └── evaluate.py Train baseline models and print precision/recall/MAE
├── research/ Academic paper summaries, aggregate D0 database
├── .github/wiki/ GitHub Wiki pages (synced via Actions)
└── requirements.txt

137
data/SCHEMA.md Normal file
View file

@ -0,0 +1,137 @@
# Dataset Schema: Verified Twilight Sightings
Documents all columns in the processed verified-sightings CSV files used for ML training.
**Files covered:**
- `data/processed/fajr_angles.csv` — Fajr (pre-dawn) twilight observations
- `data/processed/isha_angles.csv` — Isha (post-dusk) twilight observations
**Row count (as of latest build):**
- fajr_angles.csv: 48,668 rows
- isha_angles.csv: 34,529 rows
---
## Column Definitions
### `date`
- **Type:** string (ISO 8601 date)
- **Format:** `YYYY-MM-DD`
- **Example:** `2024-02-02`
- **Description:** Calendar date of the observation in UTC. Used for time-aware train/test splitting. Not used directly as a model feature.
- **Null policy:** Never null. Every row has a valid date.
- **Units:** N/A (date string)
---
### `utc_dt`
- **Type:** string (ISO 8601 datetime with UTC timezone)
- **Format:** `YYYY-MM-DD HH:MM:SS+00:00`
- **Example:** `2024-02-02 00:12:00+00:00`
- **Description:** UTC timestamp of the twilight event. The time component captures the moment of observation, used when computing solar position at the exact instant. Not used directly as a model feature to avoid timestamp leakage.
- **Null policy:** Never null.
- **Units:** N/A (timestamp)
---
### `lat`
- **Type:** float
- **Range:** -90.0 to 90.0
- **Example:** `-62.59334`
- **Description:** Geographic latitude of the observation site in decimal degrees. Negative values are south of the equator.
- **Null policy:** Never null.
- **Units:** Decimal degrees
---
### `lng`
- **Type:** float
- **Range:** -180.0 to 180.0
- **Example:** `15.46875`
- **Description:** Geographic longitude of the observation site in decimal degrees. Negative values are west of the prime meridian.
- **Null policy:** Never null.
- **Units:** Decimal degrees
---
### `elevation_m`
- **Type:** float
- **Range:** Unconstrained (negative values occur over ocean; positive values over land)
- **Example:** `-5007.54` (ocean), `432.0` (land station)
- **Description:** Elevation of the observation site in meters above sea level. Negative values are artifacts from ocean-based sky brightness sensors where the bathymetric elevation was used. Elevation affects the optical path through the atmosphere, which in turn affects the apparent twilight depression angle.
- **Null policy:** Never null.
- **Units:** Meters
---
### `day_of_year`
- **Type:** integer
- **Range:** 1 to 365 (366 in leap years)
- **Example:** `33`
- **Description:** Julian day of the year for the observation date. Day 1 is January 1. Used as a cyclic feature in the ML pipeline (encoded as sin/cos pair to avoid the year-boundary discontinuity). Correlated with solar declination.
- **Null policy:** Never null.
- **Units:** Day number (dimensionless)
---
### `fajr_angle` (fajr_angles.csv only)
- **Type:** float
- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
- **Example:** `9.508`
- **Description:** Computed solar depression angle below the horizon at the moment of Fajr (pre-dawn astronomical twilight onset). This is the target variable (y) for Fajr ML training. Positive values indicate degrees below the horizon. Computed by back-calculating the sun's altitude at the recorded twilight timestamp using NREL SPA.
- **Null policy:** Never null. Rows with uncomputable angles are excluded during cleaning.
- **Units:** Degrees (below horizon, positive)
---
### `isha_angle` (isha_angles.csv only)
- **Type:** float
- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
- **Example:** `16.074`
- **Description:** Computed solar depression angle below the horizon at the moment of Isha (post-dusk astronomical twilight end). This is the target variable (y) for Isha ML training. Same convention as fajr_angle: positive values are degrees below the horizon.
- **Null policy:** Never null.
- **Units:** Degrees (below horizon, positive)
---
### `source`
- **Type:** string (categorical)
- **Values:** `globe_at_night`, `bsrn`, `surfrad`, `galicia_sqm`, `madrid_sqm`, `india_twilight`, `majadahonda_sqm`, `gan_mn`, `tess`, `washetdonker`
- **Example:** `globe_at_night`
- **Description:** Data source identifier. Identifies which collection pipeline produced this row. Not used as a model feature (categorical with potential label-leakage). Used for stratified analysis and debugging.
- **Null policy:** Never null.
- **Units:** N/A (categorical label)
---
### `notes`
- **Type:** string (free text)
- **Example:** `""` (empty string is most common)
- **Description:** Optional annotation attached during data cleaning or manual curation. May contain flags like `high_elevation`, `coastal`, or quality notes from the source dataset. Not used in ML training.
- **Null policy:** May be empty string. Treat as optional metadata.
- **Units:** N/A (free text)
---
## Derived Features Used in Training
The following columns are computed during training and are not stored in the CSV:
| Derived feature | Formula | Purpose |
|---|---|---|
| `sin_doy` | sin(2π × day_of_year / 365) | Cyclic seasonal encoding |
| `cos_doy` | cos(2π × day_of_year / 365) | Cyclic seasonal encoding pair |
| `abs_lat` | abs(lat) | Polar proximity (symmetric across equator) |
These are constructed by `src/evaluate.py` at training time. See `docs/ml-training-plan.md`
for the full feature engineering rationale.

119
docs/ml-training-plan.md Normal file
View file

@ -0,0 +1,119 @@
# ML Training Plan: Solar Depression Angle Prediction
## 1. Problem Statement
The pray-calc Dynamic Prayer Calculation (DPC) algorithm uses a solar depression angle to
determine Fajr and Isha prayer times. The canonical fixed angles (Fajr: 18°, Isha: 17°) do
not account for geographic variation, seasonal shifts, or atmospheric conditions. This pipeline
back-calculates empirical depression angles from verified twilight observations and trains a
regression model to predict the appropriate angle given location and date parameters.
The model output is a continuous regression value: the solar depression angle in degrees below
the horizon at which astronomical twilight (Fajr onset or Isha end) is observed. This feeds the
DPC coefficient table in pray-calc.
## 2. Dataset Description
Two processed CSVs in `data/processed/`:
- `fajr_angles.csv` — 48,668 verified Fajr observations
- `isha_angles.csv` — 34,529 verified Isha observations
Sources include GLOBE at Night sky-brightness readings, BSRN solar radiation stations, SURFRAD
network measurements, and SQM (Sky Quality Meter) deployments in Galicia, Madrid, and India.
Each row represents one twilight event with computed solar depression angle.
Full column documentation is in `data/SCHEMA.md`.
## 3. Feature Engineering
Input features used as model inputs:
| Feature | Derivation | Rationale |
|---|---|---|
| `lat` | Direct from CSV | Latitude affects solar arc and atmospheric path length |
| `lng` | Direct from CSV | Longitude affects local horizon geometry |
| `elevation_m` | Direct from CSV | Higher elevation reduces atmospheric scattering |
| `day_of_year` | Direct from CSV (1-365) | Seasonal variation in solar declination |
| `sin_doy` | sin(2π × day_of_year / 365) | Cyclic encoding to avoid Dec 31/Jan 1 discontinuity |
| `cos_doy` | cos(2π × day_of_year / 365) | Cyclic encoding pair |
| `abs_lat` | abs(lat) | Polar proximity is symmetric; captures high-latitude behavior |
Features deliberately excluded: `utc_dt` as a raw timestamp (leaks future information),
`source` (categorical noise), `notes` (free text).
No imputation is needed: the CSVs have no null values in the feature columns.
## 4. Model Architecture Options
**Option A: Ridge Regression (baseline)**
Interpretable, fast to train, no hyperparameter sensitivity. Establishes a performance floor.
Expected MAE: ~1.5-2.5°. Use as the primary baseline.
**Option B: Gradient Boosting Regressor (recommended)**
Captures nonlinear geographic and seasonal interactions without manual feature crosses.
scikit-learn `GradientBoostingRegressor` with 200 estimators, learning rate 0.05, max depth 4.
Expected MAE: ~0.8-1.5°.
**Option C: Neural Network**
Multi-layer perceptron with 2 hidden layers (64, 32 units), ReLU activations, dropout 0.1.
Higher capacity but adds training complexity with limited benefit on tabular data of this size.
Consider in P2 if gradient boosting does not reach the 0.5° MAE target.
The evaluate.py script implements Options A and B and reports metrics for both.
## 5. Cross-Validation Strategy
Time-aware split to avoid look-ahead bias. The dataset spans 2019-2024. Split:
- Training set: observations with `date < 2023-01-01` (~80%)
- Test set: observations with `date >= 2023-01-01` (~20%)
Standard random shuffle is not used because a random split would allow future observations
to inform past predictions — unrealistic for deployment. The time cutoff simulates the
model being deployed on data it has never seen.
No k-fold cross-validation is applied to the test set. A single time-aware holdout is
sufficient for this dataset size and avoids the complexity of time-series cross-validation
folds.
## 6. Evaluation Metrics
| Metric | Formula | Target |
|---|---|---|
| MAE | mean(abs(predicted - actual)) | < 1.0° for gradient boosting |
| RMSE | sqrt(mean((predicted - actual)²)) | < 1.5° |
| Precision at 0.5° | fraction where abs error < 0.5° | > 40% |
| R² | 1 - SS_res / SS_tot | > 0.5 |
MAE is the primary metric because pray-calc tolerates angle errors up to 1° before prayer
time offset exceeds one minute at mid-latitudes.
## 7. Training Pipeline Steps
1. Load `data/processed/fajr_angles.csv` and `isha_angles.csv`.
2. Construct cyclic features (`sin_doy`, `cos_doy`) and `abs_lat`.
3. Split on date cutoff (2023-01-01).
4. Fit a `StandardScaler` on the training set; transform both sets.
5. Train Ridge Regression baseline. Compute metrics on test set.
6. Train Gradient Boosting Regressor. Compute metrics on test set.
7. Print structured output: one block per prayer type (Fajr/Isha), one row per model.
## 8. Expected Outputs
Running `python src/evaluate.py` produces structured text to stdout:
```
=== Fajr angle prediction ===
Model MAE RMSE R² Prec@0.5°
Ridge 2.14 2.89 0.21 28.3%
GradientBoosting 0.91 1.32 0.63 42.1%
=== Isha angle prediction ===
Model MAE RMSE R² Prec@0.5°
Ridge 1.98 2.71 0.19 29.7%
GradientBoosting 0.85 1.24 0.67 44.6%
```
These numbers are illustrative. Actual values depend on dataset distribution. The script
exits with code 0 on success and 1 on any error.

View file

@ -0,0 +1,24 @@
"""
analysis -- descriptive statistics and visualizations for sightings DataFrames.
Modules:
sightings_stats -- summary stats, geographic coverage, and matplotlib/seaborn plots
"""
from .sightings_stats import (
angle_summary,
geographic_coverage,
plot_angle_distribution,
plot_angle_vs_latitude,
plot_angle_vs_day_of_year,
print_dataset_report,
)
__all__ = [
"angle_summary",
"geographic_coverage",
"plot_angle_distribution",
"plot_angle_vs_latitude",
"plot_angle_vs_day_of_year",
"print_dataset_report",
]

View file

@ -0,0 +1,190 @@
"""
sightings_stats -- descriptive statistics and plots for the sightings dataset.
Purpose:
Provides summary statistics, geographic coverage reports, and matplotlib/
seaborn visualizations for the processed Fajr and Isha sightings DataFrames.
These functions are used in exploratory analysis and to validate the dataset
after pipeline runs.
Input DataFrame schema (Fajr):
date - str or datetime.date
utc_dt - datetime (timezone-aware UTC)
lat - float, decimal degrees
lng - float, decimal degrees
elevation_m - float, metres above sea level
day_of_year - int (optional, added by build_feature_matrix)
fajr_angle - float, solar depression angle in degrees
source - str
notes - str
Input DataFrame schema (Isha):
Same as above but with ``isha_angle`` instead of ``fajr_angle``.
Key functions:
angle_summary(df, angle_col) -> pd.Series (describe() stats)
geographic_coverage(df) -> dict (lat range, unique locs, date range)
plot_angle_distribution(df, angle_col) -> matplotlib Figure
plot_angle_vs_latitude(df, angle_col) -> matplotlib Figure
plot_angle_vs_day_of_year(df, angle_col) -> matplotlib Figure
print_dataset_report(fajr_df, isha_df) -> None (prints to stdout)
SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
"""
from __future__ import annotations
from typing import Any
import pandas as pd
def angle_summary(df: pd.DataFrame, angle_col: str) -> pd.Series:
"""
Return descriptive statistics for the named angle column.
Parameters:
df: DataFrame containing ``angle_col``.
angle_col: Column name, e.g. "fajr_angle" or "isha_angle".
Returns:
pd.Series from DataFrame.describe() count, mean, std, min, quartiles, max.
"""
return df[angle_col].describe()
def geographic_coverage(df: pd.DataFrame) -> dict[str, Any]:
"""
Return a summary of the geographic and temporal coverage of the dataset.
Parameters:
df: DataFrame with ``lat``, ``lng``, and ``date`` columns.
Returns:
dict with keys:
lat_min - float, southernmost latitude
lat_max - float, northernmost latitude
unique_locs - int, number of distinct (lat, lng) pairs
date_min - str, earliest date ("YYYY-MM-DD")
date_max - str, latest date ("YYYY-MM-DD")
total_records - int
"""
dates = df["date"].astype(str)
return {
"lat_min": float(df["lat"].min()),
"lat_max": float(df["lat"].max()),
"unique_locs": int(len(df.groupby(["lat", "lng"]))),
"date_min": str(dates.min()),
"date_max": str(dates.max()),
"total_records": len(df),
}
def plot_angle_distribution(df: pd.DataFrame, angle_col: str):
"""
Plot a histogram and KDE of the named angle column.
Requires matplotlib and seaborn. If seaborn is unavailable, falls back
to a plain matplotlib histogram.
Parameters:
df: DataFrame containing ``angle_col``.
angle_col: Column name, e.g. "fajr_angle" or "isha_angle".
Returns:
matplotlib.figure.Figure
"""
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 4))
try:
import seaborn as sns
sns.histplot(df[angle_col].dropna(), kde=True, ax=ax, bins=30)
except ImportError:
ax.hist(df[angle_col].dropna(), bins=30, edgecolor="black")
ax.set_xlabel("Solar depression angle (°)")
ax.set_ylabel("Count")
ax.set_title(f"{angle_col} distribution (n={len(df[angle_col].dropna())})")
fig.tight_layout()
return fig
def plot_angle_vs_latitude(df: pd.DataFrame, angle_col: str):
"""
Scatter plot: solar depression angle vs geographic latitude.
Shows how the depression angle varies with observer latitude. High-latitude
sites tend to have lower effective depression angles due to shallow twilight
geometry.
Parameters:
df: DataFrame with ``lat`` and ``angle_col`` columns.
angle_col: Column name.
Returns:
matplotlib.figure.Figure
"""
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 5))
valid = df[[angle_col, "lat"]].dropna()
ax.scatter(valid["lat"], valid[angle_col], alpha=0.4, s=15)
ax.set_xlabel("Latitude (°)")
ax.set_ylabel("Solar depression angle (°)")
ax.set_title(f"{angle_col} vs latitude")
fig.tight_layout()
return fig
def plot_angle_vs_day_of_year(df: pd.DataFrame, angle_col: str):
"""
Scatter plot: solar depression angle vs day of year.
Requires a ``day_of_year`` column (add with build_feature_matrix). Shows
seasonal variation in the observed depression angle.
Parameters:
df: DataFrame with ``day_of_year`` and ``angle_col`` columns.
angle_col: Column name.
Returns:
matplotlib.figure.Figure
"""
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
valid = df[[angle_col, "day_of_year"]].dropna()
ax.scatter(valid["day_of_year"], valid[angle_col], alpha=0.4, s=15)
ax.set_xlabel("Day of year")
ax.set_ylabel("Solar depression angle (°)")
ax.set_title(f"{angle_col} vs day of year")
ax.set_xlim(1, 366)
fig.tight_layout()
return fig
def print_dataset_report(fajr_df: pd.DataFrame, isha_df: pd.DataFrame) -> None:
"""
Print a concise text report of both datasets to stdout.
Mirrors the summary output produced by src/pipeline.py main(). Useful for
quick sanity checks after pipeline runs.
Parameters:
fajr_df: Fajr angles DataFrame with ``fajr_angle`` column.
isha_df: Isha angles DataFrame with ``isha_angle`` column.
"""
print(f"\nFajr dataset: {len(fajr_df)} records")
if len(fajr_df) > 0:
print("\nFajr angle stats:")
print(angle_summary(fajr_df, "fajr_angle").to_string())
cov = geographic_coverage(fajr_df)
print("\nFajr geographic coverage:")
print(f" Latitude range: {cov['lat_min']:.1f}° to {cov['lat_max']:.1f}°")
print(f" Unique locations: {cov['unique_locs']}")
print(f" Date range: {cov['date_min']} to {cov['date_max']}")
print(f"\nIsha dataset: {len(isha_df)} records")
if len(isha_df) > 0:
print("\nIsha angle stats:")
print(angle_summary(isha_df, "isha_angle").to_string())

View file

@ -0,0 +1,29 @@
"""
data sightings data loading and cleaning.
Modules:
sightings_loader raw VERIFIED_SIGHTINGS list + SightingRecord schema + loader
sightings_clean quality filtering and deduplication utilities
"""
from .sightings_loader import (
SightingRecord,
VERIFIED_SIGHTINGS,
load_verified_sightings,
)
from .sightings_clean import (
filter_non_genuine,
deduplicate_sightings,
apply_quality_filters,
BAD_NOTE_MARKERS,
)
__all__ = [
"SightingRecord",
"VERIFIED_SIGHTINGS",
"load_verified_sightings",
"filter_non_genuine",
"deduplicate_sightings",
"apply_quality_filters",
"BAD_NOTE_MARKERS",
]

View file

@ -0,0 +1,261 @@
"""
sightings_clean -- quality filtering and deduplication for sightings DataFrames.
Purpose:
Provides functions that remove non-genuine records (computed/aggregate/
timetable-sourced), deduplicate cross-source overlaps, and apply geographic
and temporal quality filters. These routines were extracted from
src/pipeline.py to isolate data-cleaning concerns.
Input DataFrame schema (expected from load_verified_sightings or equivalent):
prayer - str, "fajr" or "isha"
date - str or datetime.date, local calendar date
utc_dt - datetime (timezone-aware UTC)
lat - float, decimal degrees
lng - float, decimal degrees
elevation_m - float, metres above sea level
source - str, citation string
notes - str, observer notes
angle - float (optional, needed for apply_quality_filters)
Output DataFrame schema:
Same columns as input, with invalid/duplicate rows removed.
Key functions:
filter_non_genuine(df) -- drop records computed from angles, not observed
deduplicate_sightings(df) -- drop same prayer+date+location duplicates
apply_quality_filters(df) -- drop future dates, polar lats, implausible angles
SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
"""
from __future__ import annotations
import logging
from datetime import date as date_type
import pandas as pd
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Markers that identify non-genuine records.
# A record whose notes OR source field contains any of these strings (case-
# insensitive) is dropped before angle computation because its time was not
# directly observed but inferred from a published mean angle or taken from
# a government timetable.
# ---------------------------------------------------------------------------
BAD_NOTE_MARKERS = (
# Time-computation markers
"time inferred",
"aggregate representative",
"seasonal representative",
"representative date",
"time computed",
"back-calculated from",
# Government timetable sources
"JAKIM official",
"Diyanet official",
"Diyanet Turkey",
"Diyanet research",
"Ministry of Awqaf",
"Ministry of Habous",
"Moroccan Ministry",
"Iranian Supreme Court",
"Bangladesh Islamic Foundation",
"Nigeria Islamic astronomy consensus",
"MJC South Africa standard",
"Umm al-Qura standard",
"Pakistan astronomical estimates",
"Dubai Awqaf",
"Oman Ministry",
"Kerala Islamic Body",
"Jordanian Ministry",
# Community aggregate sources
"AFIC community observations",
"Community observations",
# Khalid Shaukat seasonal aggregates
"Moonsighting.com / Khalid Shaukat",
# Per-source exclusions of known synthetic datasets
"Hamidi 2007-2008 Isha",
"OIF UMSU 2017-2020",
"Kassim Bahali 2018, Sains Malaysia 47(11)",
"Rashed et al. 2025, NRIAG J., Alexandria",
"Bandung/Jombang study 2012",
"urban LP mean",
"Saksono & Fulazzaky 2020",
"Saksono 2020, NRIAG J.",
"Hassan et al. 2016, NRIAG J. 5:9-15",
"Rashed et al. 2022, IJMET",
"Pinem et al. 2024",
# Generic note-pattern filters
"equinox aggregate",
"solstice aggregate",
"equinox inferred",
"solstice inferred",
"rep date",
# Wrong Isha criterion (shafaq ahmar is not the dataset criterion)
"Shafaq Ahmar",
"shafaq ahmar",
"red dusk twilight",
)
# Angle plausibility bounds (degrees).
# No confirmed genuine sighting falls outside these ranges.
FAJR_MIN_DEG = 7.0
ISHA_MIN_DEG = 10.0
MAX_DEG = 22.0
# Geographic filter: drop polar latitudes (no meaningful Fajr/Isha).
POLAR_LAT_THRESHOLD = 70.0
def filter_non_genuine(df: pd.DataFrame) -> pd.DataFrame:
"""
Remove records whose time was computed from a known angle rather than
directly observed by a human or calibrated instrument.
Checks both the ``notes`` and ``source`` columns against BAD_NOTE_MARKERS
(case-insensitive). Any match in either column causes the row to be dropped.
Parameters:
df: DataFrame with at least ``notes`` and ``source`` columns.
Returns:
Filtered DataFrame with non-genuine rows removed. Index is reset.
"""
bad_notes = df["notes"].apply(
lambda n: any(m.lower() in str(n).lower() for m in BAD_NOTE_MARKERS)
)
bad_source = df["source"].apply(
lambda s: any(m.lower() in str(s).lower() for m in BAD_NOTE_MARKERS)
)
mask = bad_notes | bad_source
n_dropped = mask.sum()
if n_dropped > 0:
dropped = df[mask]
logger.info(
"filter_non_genuine: dropping %d non-genuine record(s) "
"(inferred/aggregate/timetable-sourced):",
n_dropped,
)
for src, cnt in dropped["source"].value_counts().items():
logger.info(" %3d %s", cnt, src)
return df[~mask].copy().reset_index(drop=True)
def deduplicate_sightings(df: pd.DataFrame) -> pd.DataFrame:
"""
Remove duplicate records that refer to the same sighting published in
multiple sources.
Two rows are considered duplicates when they share the same (prayer, date,
lat rounded to 3 dp, lng rounded to 3 dp). The first occurrence is kept;
later occurrences are dropped. Rounding to 3 decimal places corresponds to
~111 m spatial resolution, which is sufficient to identify same-location
cross-source overlaps.
Parameters:
df: DataFrame with ``prayer``, ``date``, ``lat``, ``lng`` columns.
Returns:
Deduplicated DataFrame. Index is reset.
"""
df = df.copy()
df["_lat_r"] = df["lat"].round(3)
df["_lng_r"] = df["lng"].round(3)
dup_mask = df.duplicated(subset=["prayer", "date", "_lat_r", "_lng_r"], keep="first")
n_dups = dup_mask.sum()
if n_dups > 0:
logger.info(
"deduplicate_sightings: removing %d cross-source duplicate(s) "
"(same prayer+date+location):",
n_dups,
)
for _, row in df[dup_mask].iterrows():
logger.info(
" %s %s lat=%.3f lng=%.3f -- %s",
row["prayer"].upper(),
row["date"],
row["lat"],
row["lng"],
row["source"],
)
df = df[~dup_mask].copy()
df = df.drop(columns=["_lat_r", "_lng_r"])
return df.reset_index(drop=True)
def apply_quality_filters(df: pd.DataFrame) -> pd.DataFrame:
"""
Apply geographic, temporal, and angle plausibility filters.
Removes:
- Future-dated records (OpenFajr publishes predictions for the full year;
these are not yet-observed, so they must be excluded from the dataset).
- Polar latitudes (|lat| > 70°): Fajr and Isha have no well-defined
twilight boundary at extreme latitudes.
- Null Island coordinates (lat~0, lng~0): GPS default / missing coordinates.
- Records with implausible solar depression angles:
Fajr: outside [FAJR_MIN_DEG, MAX_DEG]
Isha: outside [ISHA_MIN_DEG, MAX_DEG]
- Records where ``angle`` is NaN (angle computation failed).
Parameters:
df: DataFrame with ``date``, ``lat``, ``lng``, ``prayer``, and ``angle``
columns. ``date`` may be a string ("YYYY-MM-DD") or datetime.date.
Returns:
Filtered DataFrame. Index is reset.
"""
df = df.copy()
today_str = str(date_type.today())
# Drop future-dated records
future = df["date"].astype(str) > today_str
if future.any():
logger.info(
"apply_quality_filters: dropping %d future-dated record(s)", future.sum()
)
df = df[~future].copy()
# Drop polar stations
polar = df["lat"].abs() > POLAR_LAT_THRESHOLD
if polar.any():
logger.info(
"apply_quality_filters: dropping %d polar record(s) (|lat| > %g°)",
polar.sum(),
POLAR_LAT_THRESHOLD,
)
df = df[~polar].copy()
# Drop Null Island
null_island = (df["lat"].abs() < 0.01) & (df["lng"].abs() < 0.01)
if null_island.any():
logger.info(
"apply_quality_filters: dropping %d Null Island record(s)", null_island.sum()
)
df = df[~null_island].copy()
# Drop implausible angles (requires "angle" column present)
if "angle" in df.columns:
fajr_bad = (df["prayer"] == "fajr") & (
(df["angle"] < FAJR_MIN_DEG) | (df["angle"] > MAX_DEG)
)
isha_bad = (df["prayer"] == "isha") & (
(df["angle"] < ISHA_MIN_DEG) | (df["angle"] > MAX_DEG)
)
bad = fajr_bad | isha_bad | df["angle"].isna()
if bad.any():
logger.info(
"apply_quality_filters: dropping %d record(s) with implausible "
"angles (Fajr: <%.0f° or >%.0f° / Isha: <%.0f° or >%.0f°)",
bad.sum(),
FAJR_MIN_DEG,
MAX_DEG,
ISHA_MIN_DEG,
MAX_DEG,
)
df = df[~bad].copy()
return df.reset_index(drop=True)

View file

@ -0,0 +1,129 @@
"""
sightings_loader -- raw verified sightings data and loader function.
Purpose:
Provides the canonical VERIFIED_SIGHTINGS list (manually compiled
human-observed Fajr and Isha sightings), the SightingRecord TypedDict
schema, and load_verified_sightings() which converts the list to a
timezone-aware DataFrame.
Input:
verified_sightings_data.json in the same directory -- 661 records
compiled from primary sources listed below. To add records, append
to the JSON file; the structure matches SightingRecord exactly.
Output DataFrame schema (from load_verified_sightings):
date - datetime.date, local calendar date
utc_dt - datetime (timezone-aware UTC), computed from date_local +
time_local + utc_offset
lat - float, decimal degrees (north positive)
lng - float, decimal degrees (east positive)
elevation_m - float, metres above sea level (0 = unknown, to be looked up)
prayer - str, "fajr" or "isha"
source - str, citation string
notes - str, observer notes
Key functions:
load_verified_sightings() -> pd.DataFrame
Converts VERIFIED_SIGHTINGS to a timezone-aware DataFrame.
Primary sources (see verified_sightings_data.json for per-record citations):
[OIF] = OIF UMSU (Observatory, University of Muhammadiyah North Sumatra)
Published in NRIAG J. Astronomy & Geophysics; multiple papers
[NRIAG] = National Research Institute of Astronomy and Geophysics, Egypt
Papers: Hassan et al. 2014, Rashed et al. 2022, 2025
[HU] = Hizbul Ulama UK, Blackburn observations 1987-1989
Source: http://www.hizbululama.org.uk/files/salat_timing.html
[AY] = Asim Yusuf "Shedding Light on the Dawn" (2017, UK observations)
[MS] = Moonsighting.com, Khalid Shaukat observation records
[KHL] = Khalifa 2018 - Hail, Saudi Arabia naked-eye observations
NRIAG J. Astronomy & Geophysics 7:22-28, 2018
[BND] = Bandung/Jombang Indonesia study 2012 (AIP Conf. Proc. 1454)
[KAS] = Kassim Bahali et al. 2018, Malaysia+Indonesia DSLR study
Sains Malaysia 47(11)
SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
"""
import json
from datetime import datetime, timezone, timedelta
from pathlib import Path
from typing import TypedDict
import pandas as pd
class SightingRecord(TypedDict):
prayer: str # "fajr" or "isha"
date_local: str # ISO date string, YYYY-MM-DD
time_local: str # HH:MM (24h) in local time
utc_offset: float # hours offset from UTC
lat: float # decimal degrees (north positive)
lng: float # decimal degrees (east positive)
elevation_m: float # metres above sea level (0 = unknown, will be looked up)
source: str # citation
notes: str
# ---------------------------------------------------------------------------
# Load VERIFIED_SIGHTINGS from the companion JSON file.
#
# The data lives in verified_sightings_data.json (same directory) to keep
# this module under the 1600-line budget while allowing the dataset to grow
# beyond 600+ records without bloating Python source files.
#
# To add new records: append a SightingRecord-shaped dict to the JSON array.
# Run `python -m src.pipeline --no-elevation-lookup` after adding records.
# ---------------------------------------------------------------------------
_DATA_FILE = Path(__file__).parent / "verified_sightings_data.json"
def _load_data() -> list[SightingRecord]:
"""Load VERIFIED_SIGHTINGS from the companion JSON file at import time."""
with _DATA_FILE.open(encoding="utf-8") as fh:
raw: list[dict] = json.load(fh)
# Cast to SightingRecord -- JSON gives us float/str/int natively.
# Ensure elevation_m is float (JSON may store integers).
for rec in raw:
rec["utc_offset"] = float(rec["utc_offset"])
rec["lat"] = float(rec["lat"])
rec["lng"] = float(rec["lng"])
rec["elevation_m"] = float(rec["elevation_m"])
return raw # type: ignore[return-value]
# Module-level constant: load once on import.
VERIFIED_SIGHTINGS: list[SightingRecord] = _load_data()
def load_verified_sightings() -> pd.DataFrame:
"""
Return all manually compiled verified sightings as a DataFrame with
utc_dt (timezone-aware UTC) computed from date_local + time_local + utc_offset.
Output columns: date, utc_dt, lat, lng, elevation_m, prayer, source, notes
Returns:
pd.DataFrame with one row per sighting. utc_dt is timezone-aware UTC.
"""
rows = []
for s in VERIFIED_SIGHTINGS:
offset = timedelta(hours=s["utc_offset"])
local_dt = datetime.strptime(
f"{s['date_local']} {s['time_local']}", "%Y-%m-%d %H:%M"
)
utc_dt = (local_dt - offset).replace(tzinfo=timezone.utc)
rows.append(
{
"date": local_dt.date(),
"utc_dt": utc_dt,
"lat": s["lat"],
"lng": s["lng"],
"elevation_m": s["elevation_m"],
"prayer": s["prayer"],
"source": s["source"],
"notes": s["notes"],
}
)
return pd.DataFrame(rows)

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,20 @@
"""
models -- feature engineering for DPC angle prediction.
Modules:
sightings_features -- derive model input features from sightings DataFrames
"""
from .sightings_features import (
add_day_of_year,
add_seasonal_features,
build_feature_matrix,
FEATURE_COLUMNS,
)
__all__ = [
"add_day_of_year",
"add_seasonal_features",
"build_feature_matrix",
"FEATURE_COLUMNS",
]

View file

@ -0,0 +1,132 @@
"""
sightings_features -- feature engineering for DPC solar depression angle prediction.
Purpose:
Derives model-ready input features from a sightings DataFrame. The target
variable is the solar depression angle (fajr_angle or isha_angle). These
features inform the pray-calc Dynamic Prayer Calculation (DPC) formula and
any downstream ML model that replaces or calibrates it.
Input DataFrame schema:
utc_dt - datetime (timezone-aware UTC)
lat - float, decimal degrees (north positive)
lng - float, decimal degrees (east positive)
elevation_m - float, metres above sea level
Output DataFrame schema (from build_feature_matrix):
All input columns PLUS the columns listed in FEATURE_COLUMNS:
day_of_year - int, 1-366, derived from utc_dt (seasonality proxy)
lat_rad - float, latitude in radians
sin_doy - float, sin(2*pi*day_of_year/365.25), circular seasonal encoding
cos_doy - float, cos(2*pi*day_of_year/365.25), circular seasonal encoding
lat_sin_doy - float, interaction: lat * sin_doy
lat_cos_doy - float, interaction: lat * cos_doy
Key functions:
add_day_of_year(df) -> pd.DataFrame
Adds day_of_year column from utc_dt.
add_seasonal_features(df) -> pd.DataFrame
Adds circular sin/cos encodings and lat interactions.
build_feature_matrix(df) -> pd.DataFrame
Applies both transforms; returns df with all FEATURE_COLUMNS present.
SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
"""
from __future__ import annotations
import math
import pandas as pd
# The complete set of feature columns added by build_feature_matrix.
# ML models consuming these features should select from FEATURE_COLUMNS plus
# lat, lng, elevation_m as their X matrix.
FEATURE_COLUMNS = [
"day_of_year",
"lat_rad",
"sin_doy",
"cos_doy",
"lat_sin_doy",
"lat_cos_doy",
]
_TWO_PI = 2.0 * math.pi
_DAYS_PER_YEAR = 365.25
def add_day_of_year(df: pd.DataFrame) -> pd.DataFrame:
"""
Add ``day_of_year`` (1-366) derived from the ``utc_dt`` column.
``day_of_year`` is the primary seasonality feature for the DPC angle
prediction: the solar depression angle at Fajr/Isha varies systematically
across the year at any given latitude.
Parameters:
df: DataFrame with a timezone-aware ``utc_dt`` column.
Returns:
Copy of df with ``day_of_year`` (int) added.
"""
df = df.copy()
df["day_of_year"] = df["utc_dt"].apply(lambda dt: dt.timetuple().tm_yday)
return df
def add_seasonal_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Add circular seasonal encodings and latitude interaction terms.
Circular encoding avoids the discontinuity at day 1 / day 365 that a raw
``day_of_year`` integer would introduce for models that treat features as
continuous. The interaction terms capture the fact that seasonal variation
in the depression angle is stronger at higher latitudes.
Requires ``day_of_year`` and ``lat`` columns (call add_day_of_year first
if ``day_of_year`` is not already present).
New columns added:
lat_rad - latitude converted to radians
sin_doy - sin(2*pi*day_of_year/365.25)
cos_doy - cos(2*pi*day_of_year/365.25)
lat_sin_doy - lat * sin_doy
lat_cos_doy - lat * cos_doy
Parameters:
df: DataFrame with ``day_of_year`` (int) and ``lat`` (float) columns.
Returns:
Copy of df with the five new feature columns added.
"""
df = df.copy()
doy = df["day_of_year"]
lat = df["lat"]
df["lat_rad"] = lat * (math.pi / 180.0)
df["sin_doy"] = (doy * _TWO_PI / _DAYS_PER_YEAR).apply(math.sin)
df["cos_doy"] = (doy * _TWO_PI / _DAYS_PER_YEAR).apply(math.cos)
df["lat_sin_doy"] = lat * df["sin_doy"]
df["lat_cos_doy"] = lat * df["cos_doy"]
return df
def build_feature_matrix(df: pd.DataFrame) -> pd.DataFrame:
"""
Apply all feature engineering transforms and return the complete feature
matrix.
Convenience wrapper that calls add_day_of_year then add_seasonal_features.
After this call, df contains all columns in FEATURE_COLUMNS.
Parameters:
df: DataFrame with ``utc_dt`` (timezone-aware datetime) and ``lat``
(float) columns.
Returns:
Copy of df with all FEATURE_COLUMNS added. Original columns are
preserved unchanged.
"""
df = add_day_of_year(df)
df = add_seasonal_features(df)
return df

File diff suppressed because it is too large Load diff

202
src/evaluate.py Normal file
View file

@ -0,0 +1,202 @@
"""
Train and evaluate regression models for solar depression angle prediction.
Inputs: data/processed/fajr_angles.csv, data/processed/isha_angles.csv
Outputs: precision/recall/MAE structured report to stdout
Uses a time-aware 80/20 train/test split (cutoff 2023-01-01) to avoid
look-ahead bias. Evaluates two models: Ridge Regression (baseline) and
Gradient Boosting Regressor (recommended). See docs/ml-training-plan.md
for feature engineering rationale and architecture decisions.
"""
from __future__ import annotations
import math
import sys
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# ── Path resolution ────────────────────────────────────────────────────────────
def _repo_root() -> Path:
"""Walk up from this file until we find the data/processed/ directory."""
candidate = Path(__file__).resolve().parent
for _ in range(5):
if (candidate / "data" / "processed").is_dir():
return candidate
candidate = candidate.parent
raise FileNotFoundError(
"Cannot locate data/processed/ directory. "
"Run this script from within the pray-calc-ml repository."
)
# ── Feature construction ───────────────────────────────────────────────────────
FEATURES = ["lat", "lng", "elevation_m", "day_of_year", "sin_doy", "cos_doy", "abs_lat"]
TRAIN_CUTOFF = "2023-01-01" # time-aware split boundary
def build_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Add cyclic day-of-year encoding and absolute latitude.
Cyclic sin/cos encoding prevents the year-boundary discontinuity
(day 365 day 1) that a raw integer would introduce.
abs_lat captures the symmetric effect of polar proximity without
forcing the model to learn it from signed latitude.
"""
df = df.copy()
radians = 2 * math.pi * df["day_of_year"] / 365
df["sin_doy"] = np.sin(radians)
df["cos_doy"] = np.cos(radians)
df["abs_lat"] = df["lat"].abs()
return df
def split_data(
df: pd.DataFrame,
target_col: str,
) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
"""
Time-aware train/test split on the TRAIN_CUTOFF boundary.
Returns (X_train, y_train, X_test, y_test).
A random shuffle would allow future observations to inform past predictions,
which is unrealistic for deployment. Temporal split avoids this.
"""
mask_train = df["date"] < TRAIN_CUTOFF
train = df[mask_train]
test = df[~mask_train]
if train.empty or test.empty:
raise ValueError(
f"Train/test split produced an empty set "
f"(train={len(train)}, test={len(test)}). "
f"Check TRAIN_CUTOFF={TRAIN_CUTOFF!r} against the dataset date range."
)
X_train = train[FEATURES]
y_train = train[target_col]
X_test = test[FEATURES]
y_test = test[target_col]
return X_train, y_train, X_test, y_test
# ── Evaluation helpers ─────────────────────────────────────────────────────────
def precision_at_threshold(y_true: np.ndarray, y_pred: np.ndarray, thr: float = 0.5) -> float:
"""Fraction of predictions within `thr` degrees of the true value."""
return float(np.mean(np.abs(y_pred - y_true) < thr))
def evaluate_model(
model,
X_train: pd.DataFrame,
y_train: pd.Series,
X_test: pd.DataFrame,
y_test: pd.Series,
scaler: StandardScaler,
) -> dict:
"""
Fit `model` on scaled training data, score on scaled test data.
Returns a dict with MAE, RMSE, , and precision at 0.5°.
The scaler is fit on training data only; test data is transformed, not fit.
"""
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
rmse = math.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
prec = precision_at_threshold(y_test.to_numpy(), y_pred, thr=0.5)
return {"mae": mae, "rmse": rmse, "r2": r2, "prec_0_5": prec}
# ── Reporting ──────────────────────────────────────────────────────────────────
def print_results(prayer: str, results: dict[str, dict]) -> None:
"""Print a structured results table to stdout."""
header = f"=== {prayer} angle prediction ==="
print(header)
print(f"{'Model':<22} {'MAE':>6} {'RMSE':>6} {'':>6} {'Prec@0.5°':>10}")
print("-" * 55)
for model_name, m in results.items():
print(
f"{model_name:<22} "
f"{m['mae']:>6.2f} "
f"{m['rmse']:>6.2f} "
f"{m['r2']:>6.2f} "
f"{m['prec_0_5']:>9.1%}"
)
print()
# ── Main pipeline ──────────────────────────────────────────────────────────────
def run_pipeline(csv_path: Path, target_col: str, prayer_label: str) -> None:
"""
Full train/evaluate loop for one prayer dataset.
Loads the CSV, builds features, splits on the time boundary, scales,
then trains and scores both Ridge and GradientBoosting models.
"""
df = pd.read_csv(csv_path, dtype={"date": str})
df = build_features(df)
X_train, y_train, X_test, y_test = split_data(df, target_col)
# Fit scaler on training data only — no data leakage from test set
scaler = StandardScaler()
scaler.fit(X_train[FEATURES])
models = {
"Ridge": Ridge(alpha=1.0),
"GradientBoosting": GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.05,
max_depth=4,
subsample=0.8,
random_state=42,
),
}
results = {}
for name, model in models.items():
results[name] = evaluate_model(model, X_train, y_train, X_test, y_test, scaler)
print_results(prayer_label, results)
def main() -> None:
root = _repo_root()
processed = root / "data" / "processed"
fajr_csv = processed / "fajr_angles.csv"
isha_csv = processed / "isha_angles.csv"
for path in (fajr_csv, isha_csv):
if not path.exists():
print(f"ERROR: dataset not found: {path}", file=sys.stderr)
sys.exit(1)
run_pipeline(fajr_csv, target_col="fajr_angle", prayer_label="Fajr")
run_pipeline(isha_csv, target_col="isha_angle", prayer_label="Isha")
if __name__ == "__main__":
main()

1
tests/__init__.py Normal file
View file

@ -0,0 +1 @@
# tests package

View file

@ -0,0 +1,223 @@
"""
Tests for src/collect/data/sightings_clean.py
Verifies filter_non_genuine, deduplicate_sightings, and apply_quality_filters
against known bad/good records and edge cases.
"""
import sys
sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
import pandas as pd
import pytest
from datetime import datetime, timezone, timedelta
from src.collect.data.sightings_clean import (
filter_non_genuine,
deduplicate_sightings,
apply_quality_filters,
BAD_NOTE_MARKERS,
FAJR_MIN_DEG,
ISHA_MIN_DEG,
MAX_DEG,
)
def _make_row(prayer="fajr", date="2023-03-20", lat=40.0, lng=29.0,
elevation_m=100.0, source="Test source", notes="naked eye"):
"""Build a minimal valid sighting row dict."""
utc_dt = datetime(2023, 3, 20, 4, 0, 0, tzinfo=timezone.utc)
return {
"prayer": prayer,
"date": date,
"utc_dt": utc_dt,
"lat": lat,
"lng": lng,
"elevation_m": elevation_m,
"source": source,
"notes": notes,
}
class TestFilterNonGenuine:
def test_genuine_record_is_kept(self):
"""A record with no bad markers should pass through unchanged."""
df = pd.DataFrame([_make_row(notes="naked eye; per-night obs")])
result = filter_non_genuine(df)
assert len(result) == 1
def test_bad_notes_marker_drops_row(self):
"""A record with 'time inferred' in notes must be dropped."""
df = pd.DataFrame([
_make_row(notes="time inferred from D0=14.5°"),
_make_row(notes="naked eye observation"),
])
result = filter_non_genuine(df)
assert len(result) == 1
assert "naked eye observation" in result.iloc[0]["notes"]
def test_bad_source_marker_drops_row(self):
"""A record matching a bad marker in source must be dropped."""
df = pd.DataFrame([
_make_row(source="Umm al-Qura standard 18° Fajr, Makkah"),
_make_row(source="Genuine researcher 2023"),
])
result = filter_non_genuine(df)
assert len(result) == 1
def test_shafaq_ahmar_dropped(self):
"""Records with Shafaq Ahmar criterion must be dropped."""
df = pd.DataFrame([
_make_row(notes="Shafaq Ahmar (red dusk) observed"),
_make_row(notes="Shafaq al-Abyad (white dusk)"),
])
result = filter_non_genuine(df)
assert len(result) == 1
assert "al-Abyad" in result.iloc[0]["notes"]
def test_case_insensitive_match(self):
"""Bad marker matching is case-insensitive."""
df = pd.DataFrame([
_make_row(notes="AGGREGATE REPRESENTATIVE date; 4 records"),
])
result = filter_non_genuine(df)
assert len(result) == 0
def test_empty_dataframe(self):
"""Empty DataFrame returns empty DataFrame without error."""
df = pd.DataFrame(columns=["prayer", "date", "lat", "lng", "notes", "source"])
result = filter_non_genuine(df)
assert len(result) == 0
def test_bad_note_markers_is_nonempty(self):
"""BAD_NOTE_MARKERS must have at least 10 entries."""
assert len(BAD_NOTE_MARKERS) >= 10
class TestDeduplicateSightings:
def test_no_duplicates_unchanged(self):
"""DataFrame with no duplicates is returned unchanged."""
rows = [
_make_row(lat=40.0, lng=29.0, date="2023-01-01"),
_make_row(lat=51.5, lng=-0.1, date="2023-01-01"),
]
df = pd.DataFrame(rows)
result = deduplicate_sightings(df)
assert len(result) == 2
def test_exact_duplicate_removed(self):
"""Two identical (prayer, date, lat, lng) rows: keep first, drop second."""
rows = [
_make_row(lat=40.0, lng=29.0, date="2023-06-15", source="Source A"),
_make_row(lat=40.0, lng=29.0, date="2023-06-15", source="Source B"),
]
df = pd.DataFrame(rows)
result = deduplicate_sightings(df)
assert len(result) == 1
assert result.iloc[0]["source"] == "Source A"
def test_near_duplicate_within_111m_removed(self):
"""Two records rounded to same 3dp lat/lng are duplicates."""
rows = [
_make_row(lat=40.001, lng=29.001, date="2023-06-15", source="Source A"),
_make_row(lat=40.0009, lng=29.0009, date="2023-06-15", source="Source B"),
]
df = pd.DataFrame(rows)
result = deduplicate_sightings(df)
assert len(result) == 1
def test_same_location_different_prayer_both_kept(self):
"""Same lat/lng but different prayer type: both kept."""
rows = [
_make_row(prayer="fajr", lat=40.0, lng=29.0, date="2023-06-15"),
_make_row(prayer="isha", lat=40.0, lng=29.0, date="2023-06-15"),
]
df = pd.DataFrame(rows)
result = deduplicate_sightings(df)
assert len(result) == 2
def test_same_location_different_date_both_kept(self):
"""Same lat/lng and prayer but different date: both kept."""
rows = [
_make_row(lat=40.0, lng=29.0, date="2023-06-15"),
_make_row(lat=40.0, lng=29.0, date="2023-06-16"),
]
df = pd.DataFrame(rows)
result = deduplicate_sightings(df)
assert len(result) == 2
def test_no_lat_lng_temp_columns_in_output(self):
"""Deduplication must not leave _lat_r or _lng_r temp columns in output."""
df = pd.DataFrame([_make_row()])
result = deduplicate_sightings(df)
assert "_lat_r" not in result.columns
assert "_lng_r" not in result.columns
class TestApplyQualityFilters:
def _df_with_angle(self, prayer="fajr", angle=14.0, lat=40.0,
date="2023-01-15", lng=29.0):
utc_dt = datetime(2023, 1, 15, 4, 0, tzinfo=timezone.utc)
return pd.DataFrame([{
"prayer": prayer, "date": date, "utc_dt": utc_dt,
"lat": lat, "lng": lng, "elevation_m": 100.0,
"angle": angle, "source": "test", "notes": "test",
}])
def test_valid_fajr_angle_kept(self):
df = self._df_with_angle(prayer="fajr", angle=14.0)
assert len(apply_quality_filters(df)) == 1
def test_valid_isha_angle_kept(self):
df = self._df_with_angle(prayer="isha", angle=17.5)
assert len(apply_quality_filters(df)) == 1
def test_fajr_below_min_dropped(self):
"""Fajr angle below FAJR_MIN_DEG is dropped."""
df = self._df_with_angle(prayer="fajr", angle=FAJR_MIN_DEG - 0.1)
assert len(apply_quality_filters(df)) == 0
def test_isha_below_min_dropped(self):
"""Isha angle below ISHA_MIN_DEG is dropped."""
df = self._df_with_angle(prayer="isha", angle=ISHA_MIN_DEG - 0.1)
assert len(apply_quality_filters(df)) == 0
def test_angle_above_max_dropped(self):
"""Angle above MAX_DEG is dropped for both prayers."""
df_fajr = self._df_with_angle(prayer="fajr", angle=MAX_DEG + 0.1)
df_isha = self._df_with_angle(prayer="isha", angle=MAX_DEG + 0.1)
assert len(apply_quality_filters(df_fajr)) == 0
assert len(apply_quality_filters(df_isha)) == 0
def test_polar_lat_dropped(self):
"""Records with |lat| > 70 are dropped."""
df = self._df_with_angle(lat=71.0)
assert len(apply_quality_filters(df)) == 0
def test_null_island_dropped(self):
"""Records at lat~0, lng~0 are dropped."""
utc_dt = datetime(2023, 1, 15, 4, 0, tzinfo=timezone.utc)
df = pd.DataFrame([{
"prayer": "fajr", "date": "2023-01-15", "utc_dt": utc_dt,
"lat": 0.0, "lng": 0.0, "elevation_m": 0.0,
"angle": 14.0, "source": "test", "notes": "test",
}])
assert len(apply_quality_filters(df)) == 0
def test_future_date_dropped(self):
"""Records with dates in the future are dropped."""
from datetime import date
future = str(date(date.today().year + 1, 1, 1))
utc_dt = datetime(date.today().year + 1, 1, 1, 4, 0, tzinfo=timezone.utc)
df = pd.DataFrame([{
"prayer": "fajr", "date": future, "utc_dt": utc_dt,
"lat": 40.0, "lng": 29.0, "elevation_m": 100.0,
"angle": 14.0, "source": "test", "notes": "test",
}])
assert len(apply_quality_filters(df)) == 0
def test_nan_angle_dropped(self):
"""Records with NaN angle are dropped."""
import math
df = self._df_with_angle(angle=float("nan"))
assert len(apply_quality_filters(df)) == 0

View file

@ -0,0 +1,141 @@
"""
Tests for src/collect/models/sightings_features.py
Verifies add_day_of_year, add_seasonal_features, and build_feature_matrix
produce correct outputs with the expected shapes and value ranges.
"""
import sys
sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
import math
import pytest
import pandas as pd
from datetime import datetime, timezone
from src.collect.models.sightings_features import (
add_day_of_year,
add_seasonal_features,
build_feature_matrix,
FEATURE_COLUMNS,
)
def _make_df(utc_dts, lats):
"""Build a minimal DataFrame for feature testing."""
rows = []
for dt, lat in zip(utc_dts, lats):
rows.append({
"utc_dt": dt,
"lat": lat,
"lng": 30.0,
"elevation_m": 100.0,
})
return pd.DataFrame(rows)
class TestAddDayOfYear:
def test_march_equinox(self):
"""March 20 should be day 79 (non-leap year)."""
df = _make_df([datetime(2023, 3, 20, 4, 0, tzinfo=timezone.utc)], [40.0])
result = add_day_of_year(df)
assert "day_of_year" in result.columns
assert result.iloc[0]["day_of_year"] == 79
def test_jan1_is_day_1(self):
"""January 1 must be day 1."""
df = _make_df([datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc)], [0.0])
result = add_day_of_year(df)
assert result.iloc[0]["day_of_year"] == 1
def test_dec31_is_day_365_nonleap(self):
"""Dec 31 in a non-leap year must be day 365."""
df = _make_df([datetime(2023, 12, 31, 0, 0, tzinfo=timezone.utc)], [0.0])
result = add_day_of_year(df)
assert result.iloc[0]["day_of_year"] == 365
def test_dec31_is_day_366_leap(self):
"""Dec 31 in a leap year must be day 366."""
df = _make_df([datetime(2024, 12, 31, 0, 0, tzinfo=timezone.utc)], [0.0])
result = add_day_of_year(df)
assert result.iloc[0]["day_of_year"] == 366
def test_original_columns_preserved(self):
"""add_day_of_year must not remove any existing columns."""
df = _make_df([datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc)], [50.0])
result = add_day_of_year(df)
for col in df.columns:
assert col in result.columns
class TestAddSeasonalFeatures:
def setup_method(self):
dt = datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc)
self.df = add_day_of_year(_make_df([dt], [45.0]))
def test_all_feature_columns_added(self):
"""All five new seasonal feature columns must be present."""
result = add_seasonal_features(self.df)
for col in ("lat_rad", "sin_doy", "cos_doy", "lat_sin_doy", "lat_cos_doy"):
assert col in result.columns, f"Missing: {col}"
def test_sin_cos_unit_circle(self):
"""sin_doy^2 + cos_doy^2 must equal 1.0 (unit circle)."""
result = add_seasonal_features(self.df)
for _, row in result.iterrows():
magnitude = math.sqrt(row["sin_doy"] ** 2 + row["cos_doy"] ** 2)
assert abs(magnitude - 1.0) < 1e-9, f"Not on unit circle: {magnitude}"
def test_lat_rad_conversion(self):
"""lat_rad must equal lat * pi/180."""
df = add_day_of_year(_make_df(
[datetime(2023, 3, 20, 0, 0, tzinfo=timezone.utc)], [45.0]
))
result = add_seasonal_features(df)
expected = 45.0 * (math.pi / 180.0)
assert abs(result.iloc[0]["lat_rad"] - expected) < 1e-10
def test_latitude_zero_gives_zero_interactions(self):
"""At lat=0, lat_sin_doy and lat_cos_doy must be 0."""
df = add_day_of_year(_make_df(
[datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc)], [0.0]
))
result = add_seasonal_features(df)
assert result.iloc[0]["lat_sin_doy"] == 0.0
assert result.iloc[0]["lat_cos_doy"] == 0.0
class TestBuildFeatureMatrix:
def test_all_feature_columns_present(self):
"""build_feature_matrix must add all FEATURE_COLUMNS."""
dt = datetime(2023, 9, 22, 0, 0, tzinfo=timezone.utc)
df = _make_df([dt], [30.0])
result = build_feature_matrix(df)
for col in FEATURE_COLUMNS:
assert col in result.columns, f"Missing feature: {col}"
def test_original_columns_preserved(self):
"""build_feature_matrix must not remove original columns."""
dt = datetime(2023, 9, 22, 0, 0, tzinfo=timezone.utc)
df = _make_df([dt], [30.0])
result = build_feature_matrix(df)
for col in ("utc_dt", "lat", "lng", "elevation_m"):
assert col in result.columns
def test_multi_row_input(self):
"""build_feature_matrix must handle multiple rows correctly."""
dts = [
datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc),
datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc),
datetime(2023, 12, 31, 0, 0, tzinfo=timezone.utc),
]
lats = [10.0, 40.0, -30.0]
df = _make_df(dts, lats)
result = build_feature_matrix(df)
assert len(result) == 3
assert result.iloc[0]["day_of_year"] == 1
assert result.iloc[2]["day_of_year"] == 365
def test_feature_columns_constant(self):
"""FEATURE_COLUMNS list must have exactly 6 elements."""
assert len(FEATURE_COLUMNS) == 6

View file

@ -0,0 +1,116 @@
"""
Tests for src/collect/data/sightings_loader.py
Verifies that VERIFIED_SIGHTINGS has the expected structure, SightingRecord
TypedDict fields are consistent, and load_verified_sightings() returns a
correctly shaped DataFrame with UTC-aware timestamps.
"""
import pytest
from datetime import timezone
import sys
sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
from src.collect.data.sightings_loader import (
VERIFIED_SIGHTINGS,
SightingRecord,
load_verified_sightings,
)
class TestVerifiedSightingsData:
def test_sightings_list_is_nonempty(self):
"""VERIFIED_SIGHTINGS must contain at least one record."""
assert len(VERIFIED_SIGHTINGS) > 0, "VERIFIED_SIGHTINGS is empty"
def test_all_records_have_required_keys(self):
"""Every record must contain all SightingRecord fields."""
required = {"prayer", "date_local", "time_local", "utc_offset", "lat", "lng",
"elevation_m", "source", "notes"}
missing_any = []
for i, rec in enumerate(VERIFIED_SIGHTINGS):
missing = required - set(rec.keys())
if missing:
missing_any.append((i, missing))
assert not missing_any, f"Records missing keys: {missing_any[:3]}"
def test_prayer_values_are_fajr_or_isha(self):
"""Each record's prayer field must be 'fajr' or 'isha'."""
invalid = [r for r in VERIFIED_SIGHTINGS if r["prayer"] not in ("fajr", "isha")]
assert not invalid, f"{len(invalid)} records with invalid prayer value"
def test_date_local_format(self):
"""date_local must be in YYYY-MM-DD format."""
from datetime import datetime
invalid = []
for i, r in enumerate(VERIFIED_SIGHTINGS):
try:
datetime.strptime(r["date_local"], "%Y-%m-%d")
except ValueError:
invalid.append((i, r["date_local"]))
assert not invalid, f"Invalid date_local formats (first 3): {invalid[:3]}"
def test_time_local_format(self):
"""time_local must be HH:MM (24-hour)."""
from datetime import datetime
invalid = []
for i, r in enumerate(VERIFIED_SIGHTINGS):
try:
datetime.strptime(r["time_local"], "%H:%M")
except ValueError:
invalid.append((i, r["time_local"]))
assert not invalid, f"Invalid time_local formats (first 3): {invalid[:3]}"
def test_lat_lng_are_numeric_and_in_range(self):
"""lat must be in [-90, 90], lng in [-180, 180]."""
bad_lat = [r for r in VERIFIED_SIGHTINGS if not (-90 <= r["lat"] <= 90)]
bad_lng = [r for r in VERIFIED_SIGHTINGS if not (-180 <= r["lng"] <= 180)]
assert not bad_lat, f"{len(bad_lat)} records with lat out of range"
assert not bad_lng, f"{len(bad_lng)} records with lng out of range"
class TestLoadVerifiedSightings:
def setup_method(self):
self.df = load_verified_sightings()
def test_returns_dataframe(self):
"""load_verified_sightings must return a pandas DataFrame."""
import pandas as pd
assert isinstance(self.df, pd.DataFrame)
def test_row_count_matches_list(self):
"""DataFrame must have the same number of rows as VERIFIED_SIGHTINGS."""
assert len(self.df) == len(VERIFIED_SIGHTINGS)
def test_required_columns_present(self):
"""All required output columns must be present."""
required = {"date", "utc_dt", "lat", "lng", "elevation_m", "prayer", "source", "notes"}
missing = required - set(self.df.columns)
assert not missing, f"Missing columns: {missing}"
def test_utc_dt_is_timezone_aware(self):
"""utc_dt column must contain timezone-aware datetimes in UTC."""
for dt in self.df["utc_dt"].head(10):
assert dt.tzinfo is not None, f"utc_dt {dt} is not timezone-aware"
assert dt.tzinfo == timezone.utc, f"utc_dt {dt} is not UTC"
def test_prayer_column_values(self):
"""prayer column must only contain 'fajr' or 'isha'."""
unique = set(self.df["prayer"].unique())
assert unique <= {"fajr", "isha"}, f"Unexpected prayer values: {unique}"
def test_utc_conversion_blackburn_fajr(self):
"""Spot-check UTC conversion for a known Blackburn BST record.
1987-09-21, 05:30 BST (UTC+1) should convert to 04:30 UTC.
"""
blackburn = self.df[
(self.df["date"].astype(str) == "1987-09-21")
& (self.df["prayer"] == "fajr")
& (self.df["lat"].round(2) == 53.75)
]
assert len(blackburn) >= 1, "Could not find Blackburn 1987-09-21 fajr record"
utc_dt = blackburn.iloc[0]["utc_dt"]
assert utc_dt.hour == 4, f"Expected hour 4 (UTC), got {utc_dt.hour}"
assert utc_dt.minute == 30, f"Expected minute 30, got {utc_dt.minute}"

View file

@ -0,0 +1,173 @@
"""
Tests for src/collect/analysis/sightings_stats.py
Verifies angle_summary, geographic_coverage, and print_dataset_report
return correct types and values. Plot functions are tested for return type only
(no display).
"""
import sys
sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
import pytest
import pandas as pd
from datetime import datetime, timezone
from src.collect.analysis.sightings_stats import (
angle_summary,
geographic_coverage,
plot_angle_distribution,
plot_angle_vs_latitude,
plot_angle_vs_day_of_year,
print_dataset_report,
)
def _make_fajr_df(n=10):
"""Build a minimal Fajr DataFrame with day_of_year and fajr_angle."""
rows = []
for i in range(n):
day = (i * 36) + 1 # spread across the year
dt = datetime(2023, 1, 1, 4, 0, tzinfo=timezone.utc)
rows.append({
"date": f"2023-01-{i+1:02d}",
"utc_dt": dt,
"lat": 30.0 + i * 0.5,
"lng": 29.0,
"elevation_m": 100.0,
"prayer": "fajr",
"fajr_angle": 12.0 + i * 0.3,
"day_of_year": day,
"source": f"Source {i}",
"notes": "test",
})
return pd.DataFrame(rows)
def _make_isha_df(n=5):
"""Build a minimal Isha DataFrame."""
rows = []
for i in range(n):
rows.append({
"date": f"2023-03-{i+1:02d}",
"utc_dt": datetime(2023, 3, i+1, 19, 0, tzinfo=timezone.utc),
"lat": 25.0 + i,
"lng": 46.0,
"elevation_m": 620.0,
"prayer": "isha",
"isha_angle": 17.0 + i * 0.2,
"day_of_year": 60 + i,
"source": f"Isha Source {i}",
"notes": "test",
})
return pd.DataFrame(rows)
class TestAngleSummary:
def test_returns_series(self):
"""angle_summary must return a pandas Series."""
df = _make_fajr_df()
result = angle_summary(df, "fajr_angle")
assert isinstance(result, pd.Series)
def test_count_is_correct(self):
"""count in summary must equal number of non-NaN angle rows."""
df = _make_fajr_df(10)
result = angle_summary(df, "fajr_angle")
assert result["count"] == 10.0
def test_mean_is_plausible(self):
"""Mean angle must be within the data range."""
df = _make_fajr_df(10)
result = angle_summary(df, "fajr_angle")
assert df["fajr_angle"].min() <= result["mean"] <= df["fajr_angle"].max()
def test_isha_angle_col(self):
"""Works correctly with isha_angle column."""
df = _make_isha_df()
result = angle_summary(df, "isha_angle")
assert result["count"] == 5.0
class TestGeographicCoverage:
def test_returns_dict(self):
"""geographic_coverage must return a dict."""
df = _make_fajr_df()
result = geographic_coverage(df)
assert isinstance(result, dict)
def test_required_keys_present(self):
"""All expected keys must be in the result dict."""
df = _make_fajr_df()
result = geographic_coverage(df)
for key in ("lat_min", "lat_max", "unique_locs", "date_min", "date_max", "total_records"):
assert key in result, f"Missing key: {key}"
def test_total_records_correct(self):
"""total_records must equal len(df)."""
df = _make_fajr_df(7)
result = geographic_coverage(df)
assert result["total_records"] == 7
def test_lat_range_correct(self):
"""lat_min and lat_max must match DataFrame lat column bounds."""
df = _make_fajr_df(10)
result = geographic_coverage(df)
assert abs(result["lat_min"] - df["lat"].min()) < 1e-6
assert abs(result["lat_max"] - df["lat"].max()) < 1e-6
def test_unique_locs_counts_groups(self):
"""unique_locs must count distinct (lat, lng) pairs."""
df = _make_fajr_df(10)
expected = len(df.groupby(["lat", "lng"]))
result = geographic_coverage(df)
assert result["unique_locs"] == expected
class TestPrintDatasetReport:
def test_runs_without_error(self, capsys):
"""print_dataset_report must run to completion without exceptions."""
fajr = _make_fajr_df(5)
isha = _make_isha_df(3)
print_dataset_report(fajr, isha)
out = capsys.readouterr().out
assert "Fajr dataset: 5" in out
assert "Isha dataset: 3" in out
def test_empty_datasets(self, capsys):
"""print_dataset_report handles empty DataFrames without crashing."""
fajr = pd.DataFrame(columns=["fajr_angle"])
isha = pd.DataFrame(columns=["isha_angle"])
print_dataset_report(fajr, isha)
out = capsys.readouterr().out
assert "Fajr dataset: 0" in out
assert "Isha dataset: 0" in out
class TestPlotFunctions:
def test_plot_angle_distribution_returns_figure(self):
"""plot_angle_distribution must return a matplotlib Figure."""
import matplotlib
matplotlib.use("Agg") # non-interactive backend for tests
df = _make_fajr_df(10)
fig = plot_angle_distribution(df, "fajr_angle")
import matplotlib.figure
assert isinstance(fig, matplotlib.figure.Figure)
def test_plot_angle_vs_latitude_returns_figure(self):
"""plot_angle_vs_latitude must return a matplotlib Figure."""
import matplotlib
matplotlib.use("Agg")
df = _make_fajr_df(10)
fig = plot_angle_vs_latitude(df, "fajr_angle")
import matplotlib.figure
assert isinstance(fig, matplotlib.figure.Figure)
def test_plot_angle_vs_day_of_year_returns_figure(self):
"""plot_angle_vs_day_of_year must return a matplotlib Figure."""
import matplotlib
matplotlib.use("Agg")
df = _make_fajr_df(10)
fig = plot_angle_vs_day_of_year(df, "fajr_angle")
import matplotlib.figure
assert isinstance(fig, matplotlib.figure.Figure)