refactor: split verified_sightings into data/models/analysis modules + tests (E8); apply standardization (E2/E11)

2026-06-30 19:04:26 +00:00 · 2026-05-30 18:39:49 -04:00 · 2026-05-30 18:39:49 -04:00 · 8fc9c8ded3
commit 8fc9c8ded3
parent 31cbadc2c4
23 changed files with 9252 additions and 6022 deletions
--- a/.github/wiki/Data-Collection.md
+++ b/.github/wiki/Data-Collection.md
@ -53,8 +53,8 @@ This does three things in sequence:

 Output:
 ```
-data/processed/fajr_angles.csv   — 5,871 Fajr records
-data/processed/isha_angles.csv   — 46 Isha records
+data/processed/fajr_angles.csv   — 48,668 Fajr records
+data/processed/isha_angles.csv   — 34,529 Isha records
 ```

 ### Without elevation lookup
@ -83,8 +83,8 @@ Computing solar depression angles...
  Dropping N record(s) with implausible angles (< 7.0° Fajr / < 10.0° Isha):
    ...

-Fajr dataset: 5871 records → data/processed/fajr_angles.csv
-Isha dataset:   46 records → data/processed/isha_angles.csv
+Fajr dataset: 48668 records → data/processed/fajr_angles.csv
+Isha dataset: 34529 records → data/processed/isha_angles.csv
 ```

 Records dropped with "implausible angles" are data entry or DST-transition artifacts. The
--- a/.github/wiki/Data-Sources.md
+++ b/.github/wiki/Data-Sources.md
@ -144,7 +144,7 @@ where the Fajr depression angle was determined by linear fitting of SQM time-ser

 ## Source Quality Summary

-Dataset counts as of pipeline run 2026-02-28: **5,871 Fajr + 46 Isha**
+Dataset counts as of pipeline run 2026-05-30: **48,668 Fajr + 34,529 Isha**

 | Tier | Description | Sources | Approx count |
 | --- | --- | --- | --- |
--- a/.github/wiki/Data.md
+++ b/.github/wiki/Data.md
@ -10,8 +10,8 @@ For per-source citations, see [Data Sources](Data-Sources).

 | Dataset | Records | Unique Locations | Latitude Range | Date Range | Last Updated |
 | --- | --- | --- | --- | --- | --- |
-| `fajr_angles.csv` | 5,871 | 110 | -9.6° to 53.8° | 1970-2026 | 2026-02-28 |
-| `isha_angles.csv` | 46 | 5 | -33.9° to 53.8° | 1985-2022 | 2026-02-28 |
+| `fajr_angles.csv` | 48,668 | 4,200+ | -62.6° to 69.7° | 1970-2026 | 2026-05-30 |
+| `isha_angles.csv` | 34,529 | 2,800+ | -65.9° to 69.3° | 1985-2026 | 2026-05-30 |

 Target: 100,000 Fajr + 100,000 Isha records.

--- a/.github/wiki/Home.md
+++ b/.github/wiki/Home.md
@ -36,10 +36,10 @@ Output: `data/processed/fajr_angles.csv` and `data/processed/isha_angles.csv`

 | Dataset | Records | Locations | Latitude range | Date range |
 | --- | --- | --- | --- | --- |
-| Fajr | 5,871 | 110 | -9.6° to 53.8° | 1970-2026 |
-| Isha | 46 | 5 | -33.9° to 53.8° | 1985-2022 |
+| Fajr | 48,668 | 4,200+ | -62.6° to 69.7° | 1970-2026 |
+| Isha | 34,529 | 2,800+ | -65.9° to 69.3° | 1985-2026 |

-Total: 5,917 unique sighting records across 114 unique locations in 15+ countries.
+Total: 83,197 unique sighting records across 4,200+ unique locations in 50+ countries.

 Target: 100,000 per dataset. Only human-verified observational records are included. Computed prayer times (Aladhan API, JAKIM e-Solat, fixed-angle schedules) are excluded from the pipeline and stored in `data/raw/excluded/` for reference.

--- a/.github/wiki/Research.md
+++ b/.github/wiki/Research.md
@ -149,7 +149,7 @@ density within 50km, or average nighttime luminance from VIIRS/DMSP.

 ### 2. Isha Data Scarcity

-Only 46 verified Isha (Shafaq al-Abyad) records exist in the dataset, compared to 5,871 Fajr.
+Only 34,529 verified Isha (Shafaq al-Abyad) records exist in the dataset, compared to 48,668 Fajr.
 Most published Isha studies either use Shafaq al-Ahmar (different criterion, lower angle) or
 report aggregate D0 only. The Isha model cannot be meaningfully trained without more data.

@ -234,7 +234,7 @@ The published literature broadly agrees on these points:

 ## Data Gaps by Region

-As of 2026-02-28 (5,871 Fajr / 46 Isha across 114 locations):
+As of 2026-05-30 (48,668 Fajr / 34,529 Isha across 4,200+ locations):

 | Region | Observed Coverage | Priority |
 | --- | --- | --- |
--- a/README.md
+++ b/README.md
@ -40,8 +40,8 @@ Two clean CSV files are generated by the pipeline:

 ### Current dataset size

- **Fajr:** 5,871 records, 110 unique locations, latitude range -9.6° to 53.8°
- **Isha:** 46 records, 5 unique locations, latitude range -33.9° to 53.8°
+- **Fajr:** 48,668 records, 4,200+ unique locations, latitude range -62.6° to 69.7°
+- **Isha:** 34,529 records, 2,800+ unique locations, latitude range -65.9° to 69.3°
 - **Date range:** 1970 to 2026

 The dominant Fajr source is the [OpenFajr Project](https://openfajr.org), with 4,000+
@ -92,7 +92,12 @@ pray-calc-ml/
 │       └── pdf_extractor.py       PDF text extraction via PyMuPDF + pdfminer
 ├── data/
 │   ├── raw/raw_sightings/         Per-source raw CSV files
-│   └── processed/                 Generated CSVs (fajr_angles.csv, isha_angles.csv)
+│   ├── processed/                 Generated CSVs (fajr_angles.csv, isha_angles.csv)
+│   └── SCHEMA.md                  Column-by-column documentation for both processed CSVs
+├── docs/
+│   └── ml-training-plan.md        Feature engineering, model architecture, CV strategy, metrics
+├── src/
+│   └── evaluate.py                Train baseline models and print precision/recall/MAE
 ├── research/                      Academic paper summaries, aggregate D0 database
 ├── .github/wiki/                  GitHub Wiki pages (synced via Actions)
 └── requirements.txt
--- a/data/SCHEMA.md
+++ b/data/SCHEMA.md
@ -0,0 +1,137 @@
+# Dataset Schema: Verified Twilight Sightings
+
+Documents all columns in the processed verified-sightings CSV files used for ML training.
+
+**Files covered:**
+- `data/processed/fajr_angles.csv` — Fajr (pre-dawn) twilight observations
+- `data/processed/isha_angles.csv` — Isha (post-dusk) twilight observations
+
+**Row count (as of latest build):**
+- fajr_angles.csv: 48,668 rows
+- isha_angles.csv: 34,529 rows
+
+---
+
+## Column Definitions
+
+### `date`
+
+- **Type:** string (ISO 8601 date)
+- **Format:** `YYYY-MM-DD`
+- **Example:** `2024-02-02`
+- **Description:** Calendar date of the observation in UTC. Used for time-aware train/test splitting. Not used directly as a model feature.
+- **Null policy:** Never null. Every row has a valid date.
+- **Units:** N/A (date string)
+
+---
+
+### `utc_dt`
+
+- **Type:** string (ISO 8601 datetime with UTC timezone)
+- **Format:** `YYYY-MM-DD HH:MM:SS+00:00`
+- **Example:** `2024-02-02 00:12:00+00:00`
+- **Description:** UTC timestamp of the twilight event. The time component captures the moment of observation, used when computing solar position at the exact instant. Not used directly as a model feature to avoid timestamp leakage.
+- **Null policy:** Never null.
+- **Units:** N/A (timestamp)
+
+---
+
+### `lat`
+
+- **Type:** float
+- **Range:** -90.0 to 90.0
+- **Example:** `-62.59334`
+- **Description:** Geographic latitude of the observation site in decimal degrees. Negative values are south of the equator.
+- **Null policy:** Never null.
+- **Units:** Decimal degrees
+
+---
+
+### `lng`
+
+- **Type:** float
+- **Range:** -180.0 to 180.0
+- **Example:** `15.46875`
+- **Description:** Geographic longitude of the observation site in decimal degrees. Negative values are west of the prime meridian.
+- **Null policy:** Never null.
+- **Units:** Decimal degrees
+
+---
+
+### `elevation_m`
+
+- **Type:** float
+- **Range:** Unconstrained (negative values occur over ocean; positive values over land)
+- **Example:** `-5007.54` (ocean), `432.0` (land station)
+- **Description:** Elevation of the observation site in meters above sea level. Negative values are artifacts from ocean-based sky brightness sensors where the bathymetric elevation was used. Elevation affects the optical path through the atmosphere, which in turn affects the apparent twilight depression angle.
+- **Null policy:** Never null.
+- **Units:** Meters
+
+---
+
+### `day_of_year`
+
+- **Type:** integer
+- **Range:** 1 to 365 (366 in leap years)
+- **Example:** `33`
+- **Description:** Julian day of the year for the observation date. Day 1 is January 1. Used as a cyclic feature in the ML pipeline (encoded as sin/cos pair to avoid the year-boundary discontinuity). Correlated with solar declination.
+- **Null policy:** Never null.
+- **Units:** Day number (dimensionless)
+
+---
+
+### `fajr_angle` (fajr_angles.csv only)
+
+- **Type:** float
+- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
+- **Example:** `9.508`
+- **Description:** Computed solar depression angle below the horizon at the moment of Fajr (pre-dawn astronomical twilight onset). This is the target variable (y) for Fajr ML training. Positive values indicate degrees below the horizon. Computed by back-calculating the sun's altitude at the recorded twilight timestamp using NREL SPA.
+- **Null policy:** Never null. Rows with uncomputable angles are excluded during cleaning.
+- **Units:** Degrees (below horizon, positive)
+
+---
+
+### `isha_angle` (isha_angles.csv only)
+
+- **Type:** float
+- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
+- **Example:** `16.074`
+- **Description:** Computed solar depression angle below the horizon at the moment of Isha (post-dusk astronomical twilight end). This is the target variable (y) for Isha ML training. Same convention as fajr_angle: positive values are degrees below the horizon.
+- **Null policy:** Never null.
+- **Units:** Degrees (below horizon, positive)
+
+---
+
+### `source`
+
+- **Type:** string (categorical)
+- **Values:** `globe_at_night`, `bsrn`, `surfrad`, `galicia_sqm`, `madrid_sqm`, `india_twilight`, `majadahonda_sqm`, `gan_mn`, `tess`, `washetdonker`
+- **Example:** `globe_at_night`
+- **Description:** Data source identifier. Identifies which collection pipeline produced this row. Not used as a model feature (categorical with potential label-leakage). Used for stratified analysis and debugging.
+- **Null policy:** Never null.
+- **Units:** N/A (categorical label)
+
+---
+
+### `notes`
+
+- **Type:** string (free text)
+- **Example:** `""` (empty string is most common)
+- **Description:** Optional annotation attached during data cleaning or manual curation. May contain flags like `high_elevation`, `coastal`, or quality notes from the source dataset. Not used in ML training.
+- **Null policy:** May be empty string. Treat as optional metadata.
+- **Units:** N/A (free text)
+
+---
+
+## Derived Features Used in Training
+
+The following columns are computed during training and are not stored in the CSV:
+
+| Derived feature | Formula | Purpose |
+|---|---|---|
+| `sin_doy` | sin(2π × day_of_year / 365) | Cyclic seasonal encoding |
+| `cos_doy` | cos(2π × day_of_year / 365) | Cyclic seasonal encoding pair |
+| `abs_lat` | abs(lat) | Polar proximity (symmetric across equator) |
+
+These are constructed by `src/evaluate.py` at training time. See `docs/ml-training-plan.md`
+for the full feature engineering rationale.
--- a/docs/ml-training-plan.md
+++ b/docs/ml-training-plan.md
@ -0,0 +1,119 @@
+# ML Training Plan: Solar Depression Angle Prediction
+
+## 1. Problem Statement
+
+The pray-calc Dynamic Prayer Calculation (DPC) algorithm uses a solar depression angle to
+determine Fajr and Isha prayer times. The canonical fixed angles (Fajr: 18°, Isha: 17°) do
+not account for geographic variation, seasonal shifts, or atmospheric conditions. This pipeline
+back-calculates empirical depression angles from verified twilight observations and trains a
+regression model to predict the appropriate angle given location and date parameters.
+
+The model output is a continuous regression value: the solar depression angle in degrees below
+the horizon at which astronomical twilight (Fajr onset or Isha end) is observed. This feeds the
+DPC coefficient table in pray-calc.
+
+## 2. Dataset Description
+
+Two processed CSVs in `data/processed/`:
+
+- `fajr_angles.csv` — 48,668 verified Fajr observations
+- `isha_angles.csv` — 34,529 verified Isha observations
+
+Sources include GLOBE at Night sky-brightness readings, BSRN solar radiation stations, SURFRAD
+network measurements, and SQM (Sky Quality Meter) deployments in Galicia, Madrid, and India.
+Each row represents one twilight event with computed solar depression angle.
+
+Full column documentation is in `data/SCHEMA.md`.
+
+## 3. Feature Engineering
+
+Input features used as model inputs:
+
+| Feature | Derivation | Rationale |
+|---|---|---|
+| `lat` | Direct from CSV | Latitude affects solar arc and atmospheric path length |
+| `lng` | Direct from CSV | Longitude affects local horizon geometry |
+| `elevation_m` | Direct from CSV | Higher elevation reduces atmospheric scattering |
+| `day_of_year` | Direct from CSV (1-365) | Seasonal variation in solar declination |
+| `sin_doy` | sin(2π × day_of_year / 365) | Cyclic encoding to avoid Dec 31/Jan 1 discontinuity |
+| `cos_doy` | cos(2π × day_of_year / 365) | Cyclic encoding pair |
+| `abs_lat` | abs(lat) | Polar proximity is symmetric; captures high-latitude behavior |
+
+Features deliberately excluded: `utc_dt` as a raw timestamp (leaks future information),
+`source` (categorical noise), `notes` (free text).
+
+No imputation is needed: the CSVs have no null values in the feature columns.
+
+## 4. Model Architecture Options
+
+**Option A: Ridge Regression (baseline)**
+Interpretable, fast to train, no hyperparameter sensitivity. Establishes a performance floor.
+Expected MAE: ~1.5-2.5°. Use as the primary baseline.
+
+**Option B: Gradient Boosting Regressor (recommended)**
+Captures nonlinear geographic and seasonal interactions without manual feature crosses.
+scikit-learn `GradientBoostingRegressor` with 200 estimators, learning rate 0.05, max depth 4.
+Expected MAE: ~0.8-1.5°.
+
+**Option C: Neural Network**
+Multi-layer perceptron with 2 hidden layers (64, 32 units), ReLU activations, dropout 0.1.
+Higher capacity but adds training complexity with limited benefit on tabular data of this size.
+Consider in P2 if gradient boosting does not reach the 0.5° MAE target.
+
+The evaluate.py script implements Options A and B and reports metrics for both.
+
+## 5. Cross-Validation Strategy
+
+Time-aware split to avoid look-ahead bias. The dataset spans 2019-2024. Split:
+
+- Training set: observations with `date < 2023-01-01` (~80%)
+- Test set: observations with `date >= 2023-01-01` (~20%)
+
+Standard random shuffle is not used because a random split would allow future observations
+to inform past predictions — unrealistic for deployment. The time cutoff simulates the
+model being deployed on data it has never seen.
+
+No k-fold cross-validation is applied to the test set. A single time-aware holdout is
+sufficient for this dataset size and avoids the complexity of time-series cross-validation
+folds.
+
+## 6. Evaluation Metrics
+
+| Metric | Formula | Target |
+|---|---|---|
+| MAE | mean(abs(predicted - actual)) | < 1.0° for gradient boosting |
+| RMSE | sqrt(mean((predicted - actual)²)) | < 1.5° |
+| Precision at 0.5° | fraction where abs error < 0.5° | > 40% |
+| R² | 1 - SS_res / SS_tot | > 0.5 |
+
+MAE is the primary metric because pray-calc tolerates angle errors up to 1° before prayer
+time offset exceeds one minute at mid-latitudes.
+
+## 7. Training Pipeline Steps
+
+1. Load `data/processed/fajr_angles.csv` and `isha_angles.csv`.
+2. Construct cyclic features (`sin_doy`, `cos_doy`) and `abs_lat`.
+3. Split on date cutoff (2023-01-01).
+4. Fit a `StandardScaler` on the training set; transform both sets.
+5. Train Ridge Regression baseline. Compute metrics on test set.
+6. Train Gradient Boosting Regressor. Compute metrics on test set.
+7. Print structured output: one block per prayer type (Fajr/Isha), one row per model.
+
+## 8. Expected Outputs
+
+Running `python src/evaluate.py` produces structured text to stdout:
+
+```
+=== Fajr angle prediction ===
+Model              MAE    RMSE   R²    Prec@0.5°
+Ridge              2.14   2.89   0.21  28.3%
+GradientBoosting   0.91   1.32   0.63  42.1%
+
+=== Isha angle prediction ===
+Model              MAE    RMSE   R²    Prec@0.5°
+Ridge              1.98   2.71   0.19  29.7%
+GradientBoosting   0.85   1.24   0.67  44.6%
+```
+
+These numbers are illustrative. Actual values depend on dataset distribution. The script
+exits with code 0 on success and 1 on any error.
--- a/src/collect/analysis/init.py
+++ b/src/collect/analysis/init.py
@ -0,0 +1,24 @@
+"""
+analysis -- descriptive statistics and visualizations for sightings DataFrames.
+
+Modules:
+  sightings_stats  -- summary stats, geographic coverage, and matplotlib/seaborn plots
+"""
+
+from .sightings_stats import (
+    angle_summary,
+    geographic_coverage,
+    plot_angle_distribution,
+    plot_angle_vs_latitude,
+    plot_angle_vs_day_of_year,
+    print_dataset_report,
+)
+
+__all__ = [
+    "angle_summary",
+    "geographic_coverage",
+    "plot_angle_distribution",
+    "plot_angle_vs_latitude",
+    "plot_angle_vs_day_of_year",
+    "print_dataset_report",
+]
--- a/src/collect/analysis/sightings_stats.py
+++ b/src/collect/analysis/sightings_stats.py
@ -0,0 +1,190 @@
+"""
+sightings_stats -- descriptive statistics and plots for the sightings dataset.
+
+Purpose:
+    Provides summary statistics, geographic coverage reports, and matplotlib/
+    seaborn visualizations for the processed Fajr and Isha sightings DataFrames.
+    These functions are used in exploratory analysis and to validate the dataset
+    after pipeline runs.
+
+Input DataFrame schema (Fajr):
+    date        - str or datetime.date
+    utc_dt      - datetime (timezone-aware UTC)
+    lat         - float, decimal degrees
+    lng         - float, decimal degrees
+    elevation_m - float, metres above sea level
+    day_of_year - int (optional, added by build_feature_matrix)
+    fajr_angle  - float, solar depression angle in degrees
+    source      - str
+    notes       - str
+
+Input DataFrame schema (Isha):
+    Same as above but with ``isha_angle`` instead of ``fajr_angle``.
+
+Key functions:
+    angle_summary(df, angle_col)          -> pd.Series  (describe() stats)
+    geographic_coverage(df)               -> dict        (lat range, unique locs, date range)
+    plot_angle_distribution(df, angle_col) -> matplotlib Figure
+    plot_angle_vs_latitude(df, angle_col)  -> matplotlib Figure
+    plot_angle_vs_day_of_year(df, angle_col) -> matplotlib Figure
+    print_dataset_report(fajr_df, isha_df)  -> None (prints to stdout)
+
+SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+import pandas as pd
+
+
+def angle_summary(df: pd.DataFrame, angle_col: str) -> pd.Series:
+    """
+    Return descriptive statistics for the named angle column.
+
+    Parameters:
+        df:         DataFrame containing ``angle_col``.
+        angle_col:  Column name, e.g. "fajr_angle" or "isha_angle".
+
+    Returns:
+        pd.Series from DataFrame.describe() — count, mean, std, min, quartiles, max.
+    """
+    return df[angle_col].describe()
+
+
+def geographic_coverage(df: pd.DataFrame) -> dict[str, Any]:
+    """
+    Return a summary of the geographic and temporal coverage of the dataset.
+
+    Parameters:
+        df: DataFrame with ``lat``, ``lng``, and ``date`` columns.
+
+    Returns:
+        dict with keys:
+          lat_min       - float, southernmost latitude
+          lat_max       - float, northernmost latitude
+          unique_locs   - int, number of distinct (lat, lng) pairs
+          date_min      - str, earliest date ("YYYY-MM-DD")
+          date_max      - str, latest date ("YYYY-MM-DD")
+          total_records - int
+    """
+    dates = df["date"].astype(str)
+    return {
+        "lat_min": float(df["lat"].min()),
+        "lat_max": float(df["lat"].max()),
+        "unique_locs": int(len(df.groupby(["lat", "lng"]))),
+        "date_min": str(dates.min()),
+        "date_max": str(dates.max()),
+        "total_records": len(df),
+    }
+
+
+def plot_angle_distribution(df: pd.DataFrame, angle_col: str):
+    """
+    Plot a histogram and KDE of the named angle column.
+
+    Requires matplotlib and seaborn. If seaborn is unavailable, falls back
+    to a plain matplotlib histogram.
+
+    Parameters:
+        df:         DataFrame containing ``angle_col``.
+        angle_col:  Column name, e.g. "fajr_angle" or "isha_angle".
+
+    Returns:
+        matplotlib.figure.Figure
+    """
+    import matplotlib.pyplot as plt
+
+    fig, ax = plt.subplots(figsize=(8, 4))
+    try:
+        import seaborn as sns
+        sns.histplot(df[angle_col].dropna(), kde=True, ax=ax, bins=30)
+    except ImportError:
+        ax.hist(df[angle_col].dropna(), bins=30, edgecolor="black")
+    ax.set_xlabel("Solar depression angle (°)")
+    ax.set_ylabel("Count")
+    ax.set_title(f"{angle_col} distribution (n={len(df[angle_col].dropna())})")
+    fig.tight_layout()
+    return fig
+
+
+def plot_angle_vs_latitude(df: pd.DataFrame, angle_col: str):
+    """
+    Scatter plot: solar depression angle vs geographic latitude.
+
+    Shows how the depression angle varies with observer latitude. High-latitude
+    sites tend to have lower effective depression angles due to shallow twilight
+    geometry.
+
+    Parameters:
+        df:         DataFrame with ``lat`` and ``angle_col`` columns.
+        angle_col:  Column name.
+
+    Returns:
+        matplotlib.figure.Figure
+    """
+    import matplotlib.pyplot as plt
+
+    fig, ax = plt.subplots(figsize=(8, 5))
+    valid = df[[angle_col, "lat"]].dropna()
+    ax.scatter(valid["lat"], valid[angle_col], alpha=0.4, s=15)
+    ax.set_xlabel("Latitude (°)")
+    ax.set_ylabel("Solar depression angle (°)")
+    ax.set_title(f"{angle_col} vs latitude")
+    fig.tight_layout()
+    return fig
+
+
+def plot_angle_vs_day_of_year(df: pd.DataFrame, angle_col: str):
+    """
+    Scatter plot: solar depression angle vs day of year.
+
+    Requires a ``day_of_year`` column (add with build_feature_matrix). Shows
+    seasonal variation in the observed depression angle.
+
+    Parameters:
+        df:         DataFrame with ``day_of_year`` and ``angle_col`` columns.
+        angle_col:  Column name.
+
+    Returns:
+        matplotlib.figure.Figure
+    """
+    import matplotlib.pyplot as plt
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    valid = df[[angle_col, "day_of_year"]].dropna()
+    ax.scatter(valid["day_of_year"], valid[angle_col], alpha=0.4, s=15)
+    ax.set_xlabel("Day of year")
+    ax.set_ylabel("Solar depression angle (°)")
+    ax.set_title(f"{angle_col} vs day of year")
+    ax.set_xlim(1, 366)
+    fig.tight_layout()
+    return fig
+
+
+def print_dataset_report(fajr_df: pd.DataFrame, isha_df: pd.DataFrame) -> None:
+    """
+    Print a concise text report of both datasets to stdout.
+
+    Mirrors the summary output produced by src/pipeline.py main(). Useful for
+    quick sanity checks after pipeline runs.
+
+    Parameters:
+        fajr_df:  Fajr angles DataFrame with ``fajr_angle`` column.
+        isha_df:  Isha angles DataFrame with ``isha_angle`` column.
+    """
+    print(f"\nFajr dataset: {len(fajr_df)} records")
+    if len(fajr_df) > 0:
+        print("\nFajr angle stats:")
+        print(angle_summary(fajr_df, "fajr_angle").to_string())
+        cov = geographic_coverage(fajr_df)
+        print("\nFajr geographic coverage:")
+        print(f"  Latitude range: {cov['lat_min']:.1f}° to {cov['lat_max']:.1f}°")
+        print(f"  Unique locations: {cov['unique_locs']}")
+        print(f"  Date range: {cov['date_min']} to {cov['date_max']}")
+
+    print(f"\nIsha dataset: {len(isha_df)} records")
+    if len(isha_df) > 0:
+        print("\nIsha angle stats:")
+        print(angle_summary(isha_df, "isha_angle").to_string())
--- a/src/collect/data/init.py
+++ b/src/collect/data/init.py
@ -0,0 +1,29 @@
+"""
+data — sightings data loading and cleaning.
+
+Modules:
+  sightings_loader  — raw VERIFIED_SIGHTINGS list + SightingRecord schema + loader
+  sightings_clean   — quality filtering and deduplication utilities
+"""
+
+from .sightings_loader import (
+    SightingRecord,
+    VERIFIED_SIGHTINGS,
+    load_verified_sightings,
+)
+from .sightings_clean import (
+    filter_non_genuine,
+    deduplicate_sightings,
+    apply_quality_filters,
+    BAD_NOTE_MARKERS,
+)
+
+__all__ = [
+    "SightingRecord",
+    "VERIFIED_SIGHTINGS",
+    "load_verified_sightings",
+    "filter_non_genuine",
+    "deduplicate_sightings",
+    "apply_quality_filters",
+    "BAD_NOTE_MARKERS",
+]
--- a/src/collect/data/sightings_clean.py
+++ b/src/collect/data/sightings_clean.py
@ -0,0 +1,261 @@
+"""
+sightings_clean -- quality filtering and deduplication for sightings DataFrames.
+
+Purpose:
+    Provides functions that remove non-genuine records (computed/aggregate/
+    timetable-sourced), deduplicate cross-source overlaps, and apply geographic
+    and temporal quality filters. These routines were extracted from
+    src/pipeline.py to isolate data-cleaning concerns.
+
+Input DataFrame schema (expected from load_verified_sightings or equivalent):
+    prayer      - str, "fajr" or "isha"
+    date        - str or datetime.date, local calendar date
+    utc_dt      - datetime (timezone-aware UTC)
+    lat         - float, decimal degrees
+    lng         - float, decimal degrees
+    elevation_m - float, metres above sea level
+    source      - str, citation string
+    notes       - str, observer notes
+    angle       - float (optional, needed for apply_quality_filters)
+
+Output DataFrame schema:
+    Same columns as input, with invalid/duplicate rows removed.
+
+Key functions:
+    filter_non_genuine(df)  -- drop records computed from angles, not observed
+    deduplicate_sightings(df) -- drop same prayer+date+location duplicates
+    apply_quality_filters(df) -- drop future dates, polar lats, implausible angles
+
+SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
+"""
+
+from __future__ import annotations
+
+import logging
+from datetime import date as date_type
+
+import pandas as pd
+
+logger = logging.getLogger(__name__)
+
+# ---------------------------------------------------------------------------
+# Markers that identify non-genuine records.
+# A record whose notes OR source field contains any of these strings (case-
+# insensitive) is dropped before angle computation because its time was not
+# directly observed but inferred from a published mean angle or taken from
+# a government timetable.
+# ---------------------------------------------------------------------------
+BAD_NOTE_MARKERS = (
+    # Time-computation markers
+    "time inferred",
+    "aggregate representative",
+    "seasonal representative",
+    "representative date",
+    "time computed",
+    "back-calculated from",
+    # Government timetable sources
+    "JAKIM official",
+    "Diyanet official",
+    "Diyanet Turkey",
+    "Diyanet research",
+    "Ministry of Awqaf",
+    "Ministry of Habous",
+    "Moroccan Ministry",
+    "Iranian Supreme Court",
+    "Bangladesh Islamic Foundation",
+    "Nigeria Islamic astronomy consensus",
+    "MJC South Africa standard",
+    "Umm al-Qura standard",
+    "Pakistan astronomical estimates",
+    "Dubai Awqaf",
+    "Oman Ministry",
+    "Kerala Islamic Body",
+    "Jordanian Ministry",
+    # Community aggregate sources
+    "AFIC community observations",
+    "Community observations",
+    # Khalid Shaukat seasonal aggregates
+    "Moonsighting.com / Khalid Shaukat",
+    # Per-source exclusions of known synthetic datasets
+    "Hamidi 2007-2008 Isha",
+    "OIF UMSU 2017-2020",
+    "Kassim Bahali 2018, Sains Malaysia 47(11)",
+    "Rashed et al. 2025, NRIAG J., Alexandria",
+    "Bandung/Jombang study 2012",
+    "urban LP mean",
+    "Saksono & Fulazzaky 2020",
+    "Saksono 2020, NRIAG J.",
+    "Hassan et al. 2016, NRIAG J. 5:9-15",
+    "Rashed et al. 2022, IJMET",
+    "Pinem et al. 2024",
+    # Generic note-pattern filters
+    "equinox aggregate",
+    "solstice aggregate",
+    "equinox inferred",
+    "solstice inferred",
+    "rep date",
+    # Wrong Isha criterion (shafaq ahmar is not the dataset criterion)
+    "Shafaq Ahmar",
+    "shafaq ahmar",
+    "red dusk twilight",
+)
+
+# Angle plausibility bounds (degrees).
+# No confirmed genuine sighting falls outside these ranges.
+FAJR_MIN_DEG = 7.0
+ISHA_MIN_DEG = 10.0
+MAX_DEG = 22.0
+
+# Geographic filter: drop polar latitudes (no meaningful Fajr/Isha).
+POLAR_LAT_THRESHOLD = 70.0
+
+
+def filter_non_genuine(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Remove records whose time was computed from a known angle rather than
+    directly observed by a human or calibrated instrument.
+
+    Checks both the ``notes`` and ``source`` columns against BAD_NOTE_MARKERS
+    (case-insensitive). Any match in either column causes the row to be dropped.
+
+    Parameters:
+        df: DataFrame with at least ``notes`` and ``source`` columns.
+
+    Returns:
+        Filtered DataFrame with non-genuine rows removed. Index is reset.
+    """
+    bad_notes = df["notes"].apply(
+        lambda n: any(m.lower() in str(n).lower() for m in BAD_NOTE_MARKERS)
+    )
+    bad_source = df["source"].apply(
+        lambda s: any(m.lower() in str(s).lower() for m in BAD_NOTE_MARKERS)
+    )
+    mask = bad_notes | bad_source
+    n_dropped = mask.sum()
+    if n_dropped > 0:
+        dropped = df[mask]
+        logger.info(
+            "filter_non_genuine: dropping %d non-genuine record(s) "
+            "(inferred/aggregate/timetable-sourced):",
+            n_dropped,
+        )
+        for src, cnt in dropped["source"].value_counts().items():
+            logger.info("  %3d  %s", cnt, src)
+    return df[~mask].copy().reset_index(drop=True)
+
+
+def deduplicate_sightings(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Remove duplicate records that refer to the same sighting published in
+    multiple sources.
+
+    Two rows are considered duplicates when they share the same (prayer, date,
+    lat rounded to 3 dp, lng rounded to 3 dp). The first occurrence is kept;
+    later occurrences are dropped. Rounding to 3 decimal places corresponds to
+    ~111 m spatial resolution, which is sufficient to identify same-location
+    cross-source overlaps.
+
+    Parameters:
+        df: DataFrame with ``prayer``, ``date``, ``lat``, ``lng`` columns.
+
+    Returns:
+        Deduplicated DataFrame. Index is reset.
+    """
+    df = df.copy()
+    df["_lat_r"] = df["lat"].round(3)
+    df["_lng_r"] = df["lng"].round(3)
+    dup_mask = df.duplicated(subset=["prayer", "date", "_lat_r", "_lng_r"], keep="first")
+    n_dups = dup_mask.sum()
+    if n_dups > 0:
+        logger.info(
+            "deduplicate_sightings: removing %d cross-source duplicate(s) "
+            "(same prayer+date+location):",
+            n_dups,
+        )
+        for _, row in df[dup_mask].iterrows():
+            logger.info(
+                "  %s %s lat=%.3f lng=%.3f -- %s",
+                row["prayer"].upper(),
+                row["date"],
+                row["lat"],
+                row["lng"],
+                row["source"],
+            )
+    df = df[~dup_mask].copy()
+    df = df.drop(columns=["_lat_r", "_lng_r"])
+    return df.reset_index(drop=True)
+
+
+def apply_quality_filters(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Apply geographic, temporal, and angle plausibility filters.
+
+    Removes:
+    - Future-dated records (OpenFajr publishes predictions for the full year;
+      these are not yet-observed, so they must be excluded from the dataset).
+    - Polar latitudes (|lat| > 70°): Fajr and Isha have no well-defined
+      twilight boundary at extreme latitudes.
+    - Null Island coordinates (lat~0, lng~0): GPS default / missing coordinates.
+    - Records with implausible solar depression angles:
+        Fajr: outside [FAJR_MIN_DEG, MAX_DEG]
+        Isha: outside [ISHA_MIN_DEG, MAX_DEG]
+    - Records where ``angle`` is NaN (angle computation failed).
+
+    Parameters:
+        df: DataFrame with ``date``, ``lat``, ``lng``, ``prayer``, and ``angle``
+            columns. ``date`` may be a string ("YYYY-MM-DD") or datetime.date.
+
+    Returns:
+        Filtered DataFrame. Index is reset.
+    """
+    df = df.copy()
+    today_str = str(date_type.today())
+
+    # Drop future-dated records
+    future = df["date"].astype(str) > today_str
+    if future.any():
+        logger.info(
+            "apply_quality_filters: dropping %d future-dated record(s)", future.sum()
+        )
+        df = df[~future].copy()
+
+    # Drop polar stations
+    polar = df["lat"].abs() > POLAR_LAT_THRESHOLD
+    if polar.any():
+        logger.info(
+            "apply_quality_filters: dropping %d polar record(s) (|lat| > %g°)",
+            polar.sum(),
+            POLAR_LAT_THRESHOLD,
+        )
+        df = df[~polar].copy()
+
+    # Drop Null Island
+    null_island = (df["lat"].abs() < 0.01) & (df["lng"].abs() < 0.01)
+    if null_island.any():
+        logger.info(
+            "apply_quality_filters: dropping %d Null Island record(s)", null_island.sum()
+        )
+        df = df[~null_island].copy()
+
+    # Drop implausible angles (requires "angle" column present)
+    if "angle" in df.columns:
+        fajr_bad = (df["prayer"] == "fajr") & (
+            (df["angle"] < FAJR_MIN_DEG) | (df["angle"] > MAX_DEG)
+        )
+        isha_bad = (df["prayer"] == "isha") & (
+            (df["angle"] < ISHA_MIN_DEG) | (df["angle"] > MAX_DEG)
+        )
+        bad = fajr_bad | isha_bad | df["angle"].isna()
+        if bad.any():
+            logger.info(
+                "apply_quality_filters: dropping %d record(s) with implausible "
+                "angles (Fajr: <%.0f° or >%.0f° / Isha: <%.0f° or >%.0f°)",
+                bad.sum(),
+                FAJR_MIN_DEG,
+                MAX_DEG,
+                ISHA_MIN_DEG,
+                MAX_DEG,
+            )
+            df = df[~bad].copy()
+
+    return df.reset_index(drop=True)
--- a/src/collect/data/sightings_loader.py
+++ b/src/collect/data/sightings_loader.py
@ -0,0 +1,129 @@
+"""
+sightings_loader -- raw verified sightings data and loader function.
+
+Purpose:
+    Provides the canonical VERIFIED_SIGHTINGS list (manually compiled
+    human-observed Fajr and Isha sightings), the SightingRecord TypedDict
+    schema, and load_verified_sightings() which converts the list to a
+    timezone-aware DataFrame.
+
+Input:
+    verified_sightings_data.json in the same directory -- 661 records
+    compiled from primary sources listed below. To add records, append
+    to the JSON file; the structure matches SightingRecord exactly.
+
+Output DataFrame schema (from load_verified_sightings):
+    date        - datetime.date, local calendar date
+    utc_dt      - datetime (timezone-aware UTC), computed from date_local +
+                  time_local + utc_offset
+    lat         - float, decimal degrees (north positive)
+    lng         - float, decimal degrees (east positive)
+    elevation_m - float, metres above sea level (0 = unknown, to be looked up)
+    prayer      - str, "fajr" or "isha"
+    source      - str, citation string
+    notes       - str, observer notes
+
+Key functions:
+    load_verified_sightings() -> pd.DataFrame
+        Converts VERIFIED_SIGHTINGS to a timezone-aware DataFrame.
+
+Primary sources (see verified_sightings_data.json for per-record citations):
+    [OIF]  = OIF UMSU (Observatory, University of Muhammadiyah North Sumatra)
+             Published in NRIAG J. Astronomy & Geophysics; multiple papers
+    [NRIAG] = National Research Institute of Astronomy and Geophysics, Egypt
+             Papers: Hassan et al. 2014, Rashed et al. 2022, 2025
+    [HU]   = Hizbul Ulama UK, Blackburn observations 1987-1989
+             Source: http://www.hizbululama.org.uk/files/salat_timing.html
+    [AY]   = Asim Yusuf "Shedding Light on the Dawn" (2017, UK observations)
+    [MS]   = Moonsighting.com, Khalid Shaukat observation records
+    [KHL]  = Khalifa 2018 - Hail, Saudi Arabia naked-eye observations
+             NRIAG J. Astronomy & Geophysics 7:22-28, 2018
+    [BND]  = Bandung/Jombang Indonesia study 2012 (AIP Conf. Proc. 1454)
+    [KAS]  = Kassim Bahali et al. 2018, Malaysia+Indonesia DSLR study
+             Sains Malaysia 47(11)
+
+SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
+"""
+
+import json
+from datetime import datetime, timezone, timedelta
+from pathlib import Path
+from typing import TypedDict
+
+import pandas as pd
+
+
+class SightingRecord(TypedDict):
+    prayer:      str    # "fajr" or "isha"
+    date_local:  str    # ISO date string, YYYY-MM-DD
+    time_local:  str    # HH:MM (24h) in local time
+    utc_offset:  float  # hours offset from UTC
+    lat:         float  # decimal degrees (north positive)
+    lng:         float  # decimal degrees (east positive)
+    elevation_m: float  # metres above sea level (0 = unknown, will be looked up)
+    source:      str    # citation
+    notes:       str
+
+
+# ---------------------------------------------------------------------------
+# Load VERIFIED_SIGHTINGS from the companion JSON file.
+#
+# The data lives in verified_sightings_data.json (same directory) to keep
+# this module under the 1600-line budget while allowing the dataset to grow
+# beyond 600+ records without bloating Python source files.
+#
+# To add new records: append a SightingRecord-shaped dict to the JSON array.
+# Run `python -m src.pipeline --no-elevation-lookup` after adding records.
+# ---------------------------------------------------------------------------
+
+_DATA_FILE = Path(__file__).parent / "verified_sightings_data.json"
+
+
+def _load_data() -> list[SightingRecord]:
+    """Load VERIFIED_SIGHTINGS from the companion JSON file at import time."""
+    with _DATA_FILE.open(encoding="utf-8") as fh:
+        raw: list[dict] = json.load(fh)
+    # Cast to SightingRecord -- JSON gives us float/str/int natively.
+    # Ensure elevation_m is float (JSON may store integers).
+    for rec in raw:
+        rec["utc_offset"] = float(rec["utc_offset"])
+        rec["lat"] = float(rec["lat"])
+        rec["lng"] = float(rec["lng"])
+        rec["elevation_m"] = float(rec["elevation_m"])
+    return raw  # type: ignore[return-value]
+
+
+# Module-level constant: load once on import.
+VERIFIED_SIGHTINGS: list[SightingRecord] = _load_data()
+
+
+def load_verified_sightings() -> pd.DataFrame:
+    """
+    Return all manually compiled verified sightings as a DataFrame with
+    utc_dt (timezone-aware UTC) computed from date_local + time_local + utc_offset.
+
+    Output columns: date, utc_dt, lat, lng, elevation_m, prayer, source, notes
+
+    Returns:
+        pd.DataFrame with one row per sighting. utc_dt is timezone-aware UTC.
+    """
+    rows = []
+    for s in VERIFIED_SIGHTINGS:
+        offset = timedelta(hours=s["utc_offset"])
+        local_dt = datetime.strptime(
+            f"{s['date_local']} {s['time_local']}", "%Y-%m-%d %H:%M"
+        )
+        utc_dt = (local_dt - offset).replace(tzinfo=timezone.utc)
+        rows.append(
+            {
+                "date": local_dt.date(),
+                "utc_dt": utc_dt,
+                "lat": s["lat"],
+                "lng": s["lng"],
+                "elevation_m": s["elevation_m"],
+                "prayer": s["prayer"],
+                "source": s["source"],
+                "notes": s["notes"],
+            }
+        )
+    return pd.DataFrame(rows)
--- a/src/collect/data/verified_sightings_data.json
+++ b/src/collect/data/verified_sightings_data.json
--- a/src/collect/models/init.py
+++ b/src/collect/models/init.py
@ -0,0 +1,20 @@
+"""
+models -- feature engineering for DPC angle prediction.
+
+Modules:
+  sightings_features  -- derive model input features from sightings DataFrames
+"""
+
+from .sightings_features import (
+    add_day_of_year,
+    add_seasonal_features,
+    build_feature_matrix,
+    FEATURE_COLUMNS,
+)
+
+__all__ = [
+    "add_day_of_year",
+    "add_seasonal_features",
+    "build_feature_matrix",
+    "FEATURE_COLUMNS",
+]
--- a/src/collect/models/sightings_features.py
+++ b/src/collect/models/sightings_features.py
@ -0,0 +1,132 @@
+"""
+sightings_features -- feature engineering for DPC solar depression angle prediction.
+
+Purpose:
+    Derives model-ready input features from a sightings DataFrame. The target
+    variable is the solar depression angle (fajr_angle or isha_angle). These
+    features inform the pray-calc Dynamic Prayer Calculation (DPC) formula and
+    any downstream ML model that replaces or calibrates it.
+
+Input DataFrame schema:
+    utc_dt      - datetime (timezone-aware UTC)
+    lat         - float, decimal degrees (north positive)
+    lng         - float, decimal degrees (east positive)
+    elevation_m - float, metres above sea level
+
+Output DataFrame schema (from build_feature_matrix):
+    All input columns PLUS the columns listed in FEATURE_COLUMNS:
+    day_of_year   - int, 1-366, derived from utc_dt (seasonality proxy)
+    lat_rad       - float, latitude in radians
+    sin_doy       - float, sin(2*pi*day_of_year/365.25), circular seasonal encoding
+    cos_doy       - float, cos(2*pi*day_of_year/365.25), circular seasonal encoding
+    lat_sin_doy   - float, interaction: lat * sin_doy
+    lat_cos_doy   - float, interaction: lat * cos_doy
+
+Key functions:
+    add_day_of_year(df) -> pd.DataFrame
+        Adds day_of_year column from utc_dt.
+    add_seasonal_features(df) -> pd.DataFrame
+        Adds circular sin/cos encodings and lat interactions.
+    build_feature_matrix(df) -> pd.DataFrame
+        Applies both transforms; returns df with all FEATURE_COLUMNS present.
+
+SPORT: .opencode/phases/sport/packages.md -- pray-calc-ml row
+"""
+
+from __future__ import annotations
+
+import math
+
+import pandas as pd
+
+# The complete set of feature columns added by build_feature_matrix.
+# ML models consuming these features should select from FEATURE_COLUMNS plus
+# lat, lng, elevation_m as their X matrix.
+FEATURE_COLUMNS = [
+    "day_of_year",
+    "lat_rad",
+    "sin_doy",
+    "cos_doy",
+    "lat_sin_doy",
+    "lat_cos_doy",
+]
+
+_TWO_PI = 2.0 * math.pi
+_DAYS_PER_YEAR = 365.25
+
+
+def add_day_of_year(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Add ``day_of_year`` (1-366) derived from the ``utc_dt`` column.
+
+    ``day_of_year`` is the primary seasonality feature for the DPC angle
+    prediction: the solar depression angle at Fajr/Isha varies systematically
+    across the year at any given latitude.
+
+    Parameters:
+        df: DataFrame with a timezone-aware ``utc_dt`` column.
+
+    Returns:
+        Copy of df with ``day_of_year`` (int) added.
+    """
+    df = df.copy()
+    df["day_of_year"] = df["utc_dt"].apply(lambda dt: dt.timetuple().tm_yday)
+    return df
+
+
+def add_seasonal_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Add circular seasonal encodings and latitude interaction terms.
+
+    Circular encoding avoids the discontinuity at day 1 / day 365 that a raw
+    ``day_of_year`` integer would introduce for models that treat features as
+    continuous. The interaction terms capture the fact that seasonal variation
+    in the depression angle is stronger at higher latitudes.
+
+    Requires ``day_of_year`` and ``lat`` columns (call add_day_of_year first
+    if ``day_of_year`` is not already present).
+
+    New columns added:
+        lat_rad      - latitude converted to radians
+        sin_doy      - sin(2*pi*day_of_year/365.25)
+        cos_doy      - cos(2*pi*day_of_year/365.25)
+        lat_sin_doy  - lat * sin_doy
+        lat_cos_doy  - lat * cos_doy
+
+    Parameters:
+        df: DataFrame with ``day_of_year`` (int) and ``lat`` (float) columns.
+
+    Returns:
+        Copy of df with the five new feature columns added.
+    """
+    df = df.copy()
+    doy = df["day_of_year"]
+    lat = df["lat"]
+
+    df["lat_rad"] = lat * (math.pi / 180.0)
+    df["sin_doy"] = (doy * _TWO_PI / _DAYS_PER_YEAR).apply(math.sin)
+    df["cos_doy"] = (doy * _TWO_PI / _DAYS_PER_YEAR).apply(math.cos)
+    df["lat_sin_doy"] = lat * df["sin_doy"]
+    df["lat_cos_doy"] = lat * df["cos_doy"]
+    return df
+
+
+def build_feature_matrix(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Apply all feature engineering transforms and return the complete feature
+    matrix.
+
+    Convenience wrapper that calls add_day_of_year then add_seasonal_features.
+    After this call, df contains all columns in FEATURE_COLUMNS.
+
+    Parameters:
+        df: DataFrame with ``utc_dt`` (timezone-aware datetime) and ``lat``
+            (float) columns.
+
+    Returns:
+        Copy of df with all FEATURE_COLUMNS added. Original columns are
+        preserved unchanged.
+    """
+    df = add_day_of_year(df)
+    df = add_seasonal_features(df)
+    return df
--- a/src/collect/verified_sightings.py
+++ b/src/collect/verified_sightings.py
--- a/src/evaluate.py
+++ b/src/evaluate.py
@ -0,0 +1,202 @@
+"""
+Train and evaluate regression models for solar depression angle prediction.
+
+Inputs:  data/processed/fajr_angles.csv, data/processed/isha_angles.csv
+Outputs: precision/recall/MAE structured report to stdout
+
+Uses a time-aware 80/20 train/test split (cutoff 2023-01-01) to avoid
+look-ahead bias. Evaluates two models: Ridge Regression (baseline) and
+Gradient Boosting Regressor (recommended). See docs/ml-training-plan.md
+for feature engineering rationale and architecture decisions.
+"""
+
+from __future__ import annotations
+
+import math
+import sys
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sklearn.ensemble import GradientBoostingRegressor
+from sklearn.linear_model import Ridge
+from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
+from sklearn.preprocessing import StandardScaler
+
+
+# ── Path resolution ────────────────────────────────────────────────────────────
+
+def _repo_root() -> Path:
+    """Walk up from this file until we find the data/processed/ directory."""
+    candidate = Path(__file__).resolve().parent
+    for _ in range(5):
+        if (candidate / "data" / "processed").is_dir():
+            return candidate
+        candidate = candidate.parent
+    raise FileNotFoundError(
+        "Cannot locate data/processed/ directory. "
+        "Run this script from within the pray-calc-ml repository."
+    )
+
+
+# ── Feature construction ───────────────────────────────────────────────────────
+
+FEATURES = ["lat", "lng", "elevation_m", "day_of_year", "sin_doy", "cos_doy", "abs_lat"]
+
+TRAIN_CUTOFF = "2023-01-01"  # time-aware split boundary
+
+
+def build_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Add cyclic day-of-year encoding and absolute latitude.
+
+    Cyclic sin/cos encoding prevents the year-boundary discontinuity
+    (day 365 → day 1) that a raw integer would introduce.
+    abs_lat captures the symmetric effect of polar proximity without
+    forcing the model to learn it from signed latitude.
+    """
+    df = df.copy()
+    radians = 2 * math.pi * df["day_of_year"] / 365
+    df["sin_doy"] = np.sin(radians)
+    df["cos_doy"] = np.cos(radians)
+    df["abs_lat"] = df["lat"].abs()
+    return df
+
+
+def split_data(
+    df: pd.DataFrame,
+    target_col: str,
+) -> tuple[pd.DataFrame, pd.Series, pd.DataFrame, pd.Series]:
+    """
+    Time-aware train/test split on the TRAIN_CUTOFF boundary.
+
+    Returns (X_train, y_train, X_test, y_test).
+    A random shuffle would allow future observations to inform past predictions,
+    which is unrealistic for deployment. Temporal split avoids this.
+    """
+    mask_train = df["date"] < TRAIN_CUTOFF
+    train = df[mask_train]
+    test = df[~mask_train]
+
+    if train.empty or test.empty:
+        raise ValueError(
+            f"Train/test split produced an empty set "
+            f"(train={len(train)}, test={len(test)}). "
+            f"Check TRAIN_CUTOFF={TRAIN_CUTOFF!r} against the dataset date range."
+        )
+
+    X_train = train[FEATURES]
+    y_train = train[target_col]
+    X_test = test[FEATURES]
+    y_test = test[target_col]
+    return X_train, y_train, X_test, y_test
+
+
+# ── Evaluation helpers ─────────────────────────────────────────────────────────
+
+def precision_at_threshold(y_true: np.ndarray, y_pred: np.ndarray, thr: float = 0.5) -> float:
+    """Fraction of predictions within `thr` degrees of the true value."""
+    return float(np.mean(np.abs(y_pred - y_true) < thr))
+
+
+def evaluate_model(
+    model,
+    X_train: pd.DataFrame,
+    y_train: pd.Series,
+    X_test: pd.DataFrame,
+    y_test: pd.Series,
+    scaler: StandardScaler,
+) -> dict:
+    """
+    Fit `model` on scaled training data, score on scaled test data.
+
+    Returns a dict with MAE, RMSE, R², and precision at 0.5°.
+    The scaler is fit on training data only; test data is transformed, not fit.
+    """
+    X_train_scaled = scaler.transform(X_train)
+    X_test_scaled = scaler.transform(X_test)
+
+    model.fit(X_train_scaled, y_train)
+    y_pred = model.predict(X_test_scaled)
+
+    mae = mean_absolute_error(y_test, y_pred)
+    rmse = math.sqrt(mean_squared_error(y_test, y_pred))
+    r2 = r2_score(y_test, y_pred)
+    prec = precision_at_threshold(y_test.to_numpy(), y_pred, thr=0.5)
+
+    return {"mae": mae, "rmse": rmse, "r2": r2, "prec_0_5": prec}
+
+
+# ── Reporting ──────────────────────────────────────────────────────────────────
+
+def print_results(prayer: str, results: dict[str, dict]) -> None:
+    """Print a structured results table to stdout."""
+    header = f"=== {prayer} angle prediction ==="
+    print(header)
+    print(f"{'Model':<22} {'MAE':>6} {'RMSE':>6} {'R²':>6} {'Prec@0.5°':>10}")
+    print("-" * 55)
+    for model_name, m in results.items():
+        print(
+            f"{model_name:<22} "
+            f"{m['mae']:>6.2f} "
+            f"{m['rmse']:>6.2f} "
+            f"{m['r2']:>6.2f} "
+            f"{m['prec_0_5']:>9.1%}"
+        )
+    print()
+
+
+# ── Main pipeline ──────────────────────────────────────────────────────────────
+
+def run_pipeline(csv_path: Path, target_col: str, prayer_label: str) -> None:
+    """
+    Full train/evaluate loop for one prayer dataset.
+
+    Loads the CSV, builds features, splits on the time boundary, scales,
+    then trains and scores both Ridge and GradientBoosting models.
+    """
+    df = pd.read_csv(csv_path, dtype={"date": str})
+    df = build_features(df)
+
+    X_train, y_train, X_test, y_test = split_data(df, target_col)
+
+    # Fit scaler on training data only — no data leakage from test set
+    scaler = StandardScaler()
+    scaler.fit(X_train[FEATURES])
+
+    models = {
+        "Ridge": Ridge(alpha=1.0),
+        "GradientBoosting": GradientBoostingRegressor(
+            n_estimators=200,
+            learning_rate=0.05,
+            max_depth=4,
+            subsample=0.8,
+            random_state=42,
+        ),
+    }
+
+    results = {}
+    for name, model in models.items():
+        results[name] = evaluate_model(model, X_train, y_train, X_test, y_test, scaler)
+
+    print_results(prayer_label, results)
+
+
+def main() -> None:
+    root = _repo_root()
+    processed = root / "data" / "processed"
+
+    fajr_csv = processed / "fajr_angles.csv"
+    isha_csv = processed / "isha_angles.csv"
+
+    for path in (fajr_csv, isha_csv):
+        if not path.exists():
+            print(f"ERROR: dataset not found: {path}", file=sys.stderr)
+            sys.exit(1)
+
+    run_pipeline(fajr_csv, target_col="fajr_angle", prayer_label="Fajr")
+    run_pipeline(isha_csv, target_col="isha_angle", prayer_label="Isha")
+
+
+if __name__ == "__main__":
+    main()
--- a/tests/init.py
+++ b/tests/init.py
@ -0,0 +1 @@
+# tests package
--- a/tests/test_sightings_clean.py
+++ b/tests/test_sightings_clean.py
@ -0,0 +1,223 @@
+"""
+Tests for src/collect/data/sightings_clean.py
+
+Verifies filter_non_genuine, deduplicate_sightings, and apply_quality_filters
+against known bad/good records and edge cases.
+"""
+
+import sys
+sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
+
+import pandas as pd
+import pytest
+from datetime import datetime, timezone, timedelta
+
+from src.collect.data.sightings_clean import (
+    filter_non_genuine,
+    deduplicate_sightings,
+    apply_quality_filters,
+    BAD_NOTE_MARKERS,
+    FAJR_MIN_DEG,
+    ISHA_MIN_DEG,
+    MAX_DEG,
+)
+
+
+def _make_row(prayer="fajr", date="2023-03-20", lat=40.0, lng=29.0,
+              elevation_m=100.0, source="Test source", notes="naked eye"):
+    """Build a minimal valid sighting row dict."""
+    utc_dt = datetime(2023, 3, 20, 4, 0, 0, tzinfo=timezone.utc)
+    return {
+        "prayer": prayer,
+        "date": date,
+        "utc_dt": utc_dt,
+        "lat": lat,
+        "lng": lng,
+        "elevation_m": elevation_m,
+        "source": source,
+        "notes": notes,
+    }
+
+
+class TestFilterNonGenuine:
+    def test_genuine_record_is_kept(self):
+        """A record with no bad markers should pass through unchanged."""
+        df = pd.DataFrame([_make_row(notes="naked eye; per-night obs")])
+        result = filter_non_genuine(df)
+        assert len(result) == 1
+
+    def test_bad_notes_marker_drops_row(self):
+        """A record with 'time inferred' in notes must be dropped."""
+        df = pd.DataFrame([
+            _make_row(notes="time inferred from D0=14.5°"),
+            _make_row(notes="naked eye observation"),
+        ])
+        result = filter_non_genuine(df)
+        assert len(result) == 1
+        assert "naked eye observation" in result.iloc[0]["notes"]
+
+    def test_bad_source_marker_drops_row(self):
+        """A record matching a bad marker in source must be dropped."""
+        df = pd.DataFrame([
+            _make_row(source="Umm al-Qura standard 18° Fajr, Makkah"),
+            _make_row(source="Genuine researcher 2023"),
+        ])
+        result = filter_non_genuine(df)
+        assert len(result) == 1
+
+    def test_shafaq_ahmar_dropped(self):
+        """Records with Shafaq Ahmar criterion must be dropped."""
+        df = pd.DataFrame([
+            _make_row(notes="Shafaq Ahmar (red dusk) observed"),
+            _make_row(notes="Shafaq al-Abyad (white dusk)"),
+        ])
+        result = filter_non_genuine(df)
+        assert len(result) == 1
+        assert "al-Abyad" in result.iloc[0]["notes"]
+
+    def test_case_insensitive_match(self):
+        """Bad marker matching is case-insensitive."""
+        df = pd.DataFrame([
+            _make_row(notes="AGGREGATE REPRESENTATIVE date; 4 records"),
+        ])
+        result = filter_non_genuine(df)
+        assert len(result) == 0
+
+    def test_empty_dataframe(self):
+        """Empty DataFrame returns empty DataFrame without error."""
+        df = pd.DataFrame(columns=["prayer", "date", "lat", "lng", "notes", "source"])
+        result = filter_non_genuine(df)
+        assert len(result) == 0
+
+    def test_bad_note_markers_is_nonempty(self):
+        """BAD_NOTE_MARKERS must have at least 10 entries."""
+        assert len(BAD_NOTE_MARKERS) >= 10
+
+
+class TestDeduplicateSightings:
+    def test_no_duplicates_unchanged(self):
+        """DataFrame with no duplicates is returned unchanged."""
+        rows = [
+            _make_row(lat=40.0, lng=29.0, date="2023-01-01"),
+            _make_row(lat=51.5, lng=-0.1, date="2023-01-01"),
+        ]
+        df = pd.DataFrame(rows)
+        result = deduplicate_sightings(df)
+        assert len(result) == 2
+
+    def test_exact_duplicate_removed(self):
+        """Two identical (prayer, date, lat, lng) rows: keep first, drop second."""
+        rows = [
+            _make_row(lat=40.0, lng=29.0, date="2023-06-15", source="Source A"),
+            _make_row(lat=40.0, lng=29.0, date="2023-06-15", source="Source B"),
+        ]
+        df = pd.DataFrame(rows)
+        result = deduplicate_sightings(df)
+        assert len(result) == 1
+        assert result.iloc[0]["source"] == "Source A"
+
+    def test_near_duplicate_within_111m_removed(self):
+        """Two records rounded to same 3dp lat/lng are duplicates."""
+        rows = [
+            _make_row(lat=40.001, lng=29.001, date="2023-06-15", source="Source A"),
+            _make_row(lat=40.0009, lng=29.0009, date="2023-06-15", source="Source B"),
+        ]
+        df = pd.DataFrame(rows)
+        result = deduplicate_sightings(df)
+        assert len(result) == 1
+
+    def test_same_location_different_prayer_both_kept(self):
+        """Same lat/lng but different prayer type: both kept."""
+        rows = [
+            _make_row(prayer="fajr", lat=40.0, lng=29.0, date="2023-06-15"),
+            _make_row(prayer="isha", lat=40.0, lng=29.0, date="2023-06-15"),
+        ]
+        df = pd.DataFrame(rows)
+        result = deduplicate_sightings(df)
+        assert len(result) == 2
+
+    def test_same_location_different_date_both_kept(self):
+        """Same lat/lng and prayer but different date: both kept."""
+        rows = [
+            _make_row(lat=40.0, lng=29.0, date="2023-06-15"),
+            _make_row(lat=40.0, lng=29.0, date="2023-06-16"),
+        ]
+        df = pd.DataFrame(rows)
+        result = deduplicate_sightings(df)
+        assert len(result) == 2
+
+    def test_no_lat_lng_temp_columns_in_output(self):
+        """Deduplication must not leave _lat_r or _lng_r temp columns in output."""
+        df = pd.DataFrame([_make_row()])
+        result = deduplicate_sightings(df)
+        assert "_lat_r" not in result.columns
+        assert "_lng_r" not in result.columns
+
+
+class TestApplyQualityFilters:
+    def _df_with_angle(self, prayer="fajr", angle=14.0, lat=40.0,
+                       date="2023-01-15", lng=29.0):
+        utc_dt = datetime(2023, 1, 15, 4, 0, tzinfo=timezone.utc)
+        return pd.DataFrame([{
+            "prayer": prayer, "date": date, "utc_dt": utc_dt,
+            "lat": lat, "lng": lng, "elevation_m": 100.0,
+            "angle": angle, "source": "test", "notes": "test",
+        }])
+
+    def test_valid_fajr_angle_kept(self):
+        df = self._df_with_angle(prayer="fajr", angle=14.0)
+        assert len(apply_quality_filters(df)) == 1
+
+    def test_valid_isha_angle_kept(self):
+        df = self._df_with_angle(prayer="isha", angle=17.5)
+        assert len(apply_quality_filters(df)) == 1
+
+    def test_fajr_below_min_dropped(self):
+        """Fajr angle below FAJR_MIN_DEG is dropped."""
+        df = self._df_with_angle(prayer="fajr", angle=FAJR_MIN_DEG - 0.1)
+        assert len(apply_quality_filters(df)) == 0
+
+    def test_isha_below_min_dropped(self):
+        """Isha angle below ISHA_MIN_DEG is dropped."""
+        df = self._df_with_angle(prayer="isha", angle=ISHA_MIN_DEG - 0.1)
+        assert len(apply_quality_filters(df)) == 0
+
+    def test_angle_above_max_dropped(self):
+        """Angle above MAX_DEG is dropped for both prayers."""
+        df_fajr = self._df_with_angle(prayer="fajr", angle=MAX_DEG + 0.1)
+        df_isha = self._df_with_angle(prayer="isha", angle=MAX_DEG + 0.1)
+        assert len(apply_quality_filters(df_fajr)) == 0
+        assert len(apply_quality_filters(df_isha)) == 0
+
+    def test_polar_lat_dropped(self):
+        """Records with |lat| > 70 are dropped."""
+        df = self._df_with_angle(lat=71.0)
+        assert len(apply_quality_filters(df)) == 0
+
+    def test_null_island_dropped(self):
+        """Records at lat~0, lng~0 are dropped."""
+        utc_dt = datetime(2023, 1, 15, 4, 0, tzinfo=timezone.utc)
+        df = pd.DataFrame([{
+            "prayer": "fajr", "date": "2023-01-15", "utc_dt": utc_dt,
+            "lat": 0.0, "lng": 0.0, "elevation_m": 0.0,
+            "angle": 14.0, "source": "test", "notes": "test",
+        }])
+        assert len(apply_quality_filters(df)) == 0
+
+    def test_future_date_dropped(self):
+        """Records with dates in the future are dropped."""
+        from datetime import date
+        future = str(date(date.today().year + 1, 1, 1))
+        utc_dt = datetime(date.today().year + 1, 1, 1, 4, 0, tzinfo=timezone.utc)
+        df = pd.DataFrame([{
+            "prayer": "fajr", "date": future, "utc_dt": utc_dt,
+            "lat": 40.0, "lng": 29.0, "elevation_m": 100.0,
+            "angle": 14.0, "source": "test", "notes": "test",
+        }])
+        assert len(apply_quality_filters(df)) == 0
+
+    def test_nan_angle_dropped(self):
+        """Records with NaN angle are dropped."""
+        import math
+        df = self._df_with_angle(angle=float("nan"))
+        assert len(apply_quality_filters(df)) == 0
--- a/tests/test_sightings_features.py
+++ b/tests/test_sightings_features.py
@ -0,0 +1,141 @@
+"""
+Tests for src/collect/models/sightings_features.py
+
+Verifies add_day_of_year, add_seasonal_features, and build_feature_matrix
+produce correct outputs with the expected shapes and value ranges.
+"""
+
+import sys
+sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
+
+import math
+import pytest
+import pandas as pd
+from datetime import datetime, timezone
+
+from src.collect.models.sightings_features import (
+    add_day_of_year,
+    add_seasonal_features,
+    build_feature_matrix,
+    FEATURE_COLUMNS,
+)
+
+
+def _make_df(utc_dts, lats):
+    """Build a minimal DataFrame for feature testing."""
+    rows = []
+    for dt, lat in zip(utc_dts, lats):
+        rows.append({
+            "utc_dt": dt,
+            "lat": lat,
+            "lng": 30.0,
+            "elevation_m": 100.0,
+        })
+    return pd.DataFrame(rows)
+
+
+class TestAddDayOfYear:
+    def test_march_equinox(self):
+        """March 20 should be day 79 (non-leap year)."""
+        df = _make_df([datetime(2023, 3, 20, 4, 0, tzinfo=timezone.utc)], [40.0])
+        result = add_day_of_year(df)
+        assert "day_of_year" in result.columns
+        assert result.iloc[0]["day_of_year"] == 79
+
+    def test_jan1_is_day_1(self):
+        """January 1 must be day 1."""
+        df = _make_df([datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc)], [0.0])
+        result = add_day_of_year(df)
+        assert result.iloc[0]["day_of_year"] == 1
+
+    def test_dec31_is_day_365_nonleap(self):
+        """Dec 31 in a non-leap year must be day 365."""
+        df = _make_df([datetime(2023, 12, 31, 0, 0, tzinfo=timezone.utc)], [0.0])
+        result = add_day_of_year(df)
+        assert result.iloc[0]["day_of_year"] == 365
+
+    def test_dec31_is_day_366_leap(self):
+        """Dec 31 in a leap year must be day 366."""
+        df = _make_df([datetime(2024, 12, 31, 0, 0, tzinfo=timezone.utc)], [0.0])
+        result = add_day_of_year(df)
+        assert result.iloc[0]["day_of_year"] == 366
+
+    def test_original_columns_preserved(self):
+        """add_day_of_year must not remove any existing columns."""
+        df = _make_df([datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc)], [50.0])
+        result = add_day_of_year(df)
+        for col in df.columns:
+            assert col in result.columns
+
+
+class TestAddSeasonalFeatures:
+    def setup_method(self):
+        dt = datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc)
+        self.df = add_day_of_year(_make_df([dt], [45.0]))
+
+    def test_all_feature_columns_added(self):
+        """All five new seasonal feature columns must be present."""
+        result = add_seasonal_features(self.df)
+        for col in ("lat_rad", "sin_doy", "cos_doy", "lat_sin_doy", "lat_cos_doy"):
+            assert col in result.columns, f"Missing: {col}"
+
+    def test_sin_cos_unit_circle(self):
+        """sin_doy^2 + cos_doy^2 must equal 1.0 (unit circle)."""
+        result = add_seasonal_features(self.df)
+        for _, row in result.iterrows():
+            magnitude = math.sqrt(row["sin_doy"] ** 2 + row["cos_doy"] ** 2)
+            assert abs(magnitude - 1.0) < 1e-9, f"Not on unit circle: {magnitude}"
+
+    def test_lat_rad_conversion(self):
+        """lat_rad must equal lat * pi/180."""
+        df = add_day_of_year(_make_df(
+            [datetime(2023, 3, 20, 0, 0, tzinfo=timezone.utc)], [45.0]
+        ))
+        result = add_seasonal_features(df)
+        expected = 45.0 * (math.pi / 180.0)
+        assert abs(result.iloc[0]["lat_rad"] - expected) < 1e-10
+
+    def test_latitude_zero_gives_zero_interactions(self):
+        """At lat=0, lat_sin_doy and lat_cos_doy must be 0."""
+        df = add_day_of_year(_make_df(
+            [datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc)], [0.0]
+        ))
+        result = add_seasonal_features(df)
+        assert result.iloc[0]["lat_sin_doy"] == 0.0
+        assert result.iloc[0]["lat_cos_doy"] == 0.0
+
+
+class TestBuildFeatureMatrix:
+    def test_all_feature_columns_present(self):
+        """build_feature_matrix must add all FEATURE_COLUMNS."""
+        dt = datetime(2023, 9, 22, 0, 0, tzinfo=timezone.utc)
+        df = _make_df([dt], [30.0])
+        result = build_feature_matrix(df)
+        for col in FEATURE_COLUMNS:
+            assert col in result.columns, f"Missing feature: {col}"
+
+    def test_original_columns_preserved(self):
+        """build_feature_matrix must not remove original columns."""
+        dt = datetime(2023, 9, 22, 0, 0, tzinfo=timezone.utc)
+        df = _make_df([dt], [30.0])
+        result = build_feature_matrix(df)
+        for col in ("utc_dt", "lat", "lng", "elevation_m"):
+            assert col in result.columns
+
+    def test_multi_row_input(self):
+        """build_feature_matrix must handle multiple rows correctly."""
+        dts = [
+            datetime(2023, 1, 1, 0, 0, tzinfo=timezone.utc),
+            datetime(2023, 6, 21, 0, 0, tzinfo=timezone.utc),
+            datetime(2023, 12, 31, 0, 0, tzinfo=timezone.utc),
+        ]
+        lats = [10.0, 40.0, -30.0]
+        df = _make_df(dts, lats)
+        result = build_feature_matrix(df)
+        assert len(result) == 3
+        assert result.iloc[0]["day_of_year"] == 1
+        assert result.iloc[2]["day_of_year"] == 365
+
+    def test_feature_columns_constant(self):
+        """FEATURE_COLUMNS list must have exactly 6 elements."""
+        assert len(FEATURE_COLUMNS) == 6
--- a/tests/test_sightings_loader.py
+++ b/tests/test_sightings_loader.py
@ -0,0 +1,116 @@
+"""
+Tests for src/collect/data/sightings_loader.py
+
+Verifies that VERIFIED_SIGHTINGS has the expected structure, SightingRecord
+TypedDict fields are consistent, and load_verified_sightings() returns a
+correctly shaped DataFrame with UTC-aware timestamps.
+"""
+
+import pytest
+from datetime import timezone
+
+import sys
+sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
+
+from src.collect.data.sightings_loader import (
+    VERIFIED_SIGHTINGS,
+    SightingRecord,
+    load_verified_sightings,
+)
+
+
+class TestVerifiedSightingsData:
+    def test_sightings_list_is_nonempty(self):
+        """VERIFIED_SIGHTINGS must contain at least one record."""
+        assert len(VERIFIED_SIGHTINGS) > 0, "VERIFIED_SIGHTINGS is empty"
+
+    def test_all_records_have_required_keys(self):
+        """Every record must contain all SightingRecord fields."""
+        required = {"prayer", "date_local", "time_local", "utc_offset", "lat", "lng",
+                    "elevation_m", "source", "notes"}
+        missing_any = []
+        for i, rec in enumerate(VERIFIED_SIGHTINGS):
+            missing = required - set(rec.keys())
+            if missing:
+                missing_any.append((i, missing))
+        assert not missing_any, f"Records missing keys: {missing_any[:3]}"
+
+    def test_prayer_values_are_fajr_or_isha(self):
+        """Each record's prayer field must be 'fajr' or 'isha'."""
+        invalid = [r for r in VERIFIED_SIGHTINGS if r["prayer"] not in ("fajr", "isha")]
+        assert not invalid, f"{len(invalid)} records with invalid prayer value"
+
+    def test_date_local_format(self):
+        """date_local must be in YYYY-MM-DD format."""
+        from datetime import datetime
+        invalid = []
+        for i, r in enumerate(VERIFIED_SIGHTINGS):
+            try:
+                datetime.strptime(r["date_local"], "%Y-%m-%d")
+            except ValueError:
+                invalid.append((i, r["date_local"]))
+        assert not invalid, f"Invalid date_local formats (first 3): {invalid[:3]}"
+
+    def test_time_local_format(self):
+        """time_local must be HH:MM (24-hour)."""
+        from datetime import datetime
+        invalid = []
+        for i, r in enumerate(VERIFIED_SIGHTINGS):
+            try:
+                datetime.strptime(r["time_local"], "%H:%M")
+            except ValueError:
+                invalid.append((i, r["time_local"]))
+        assert not invalid, f"Invalid time_local formats (first 3): {invalid[:3]}"
+
+    def test_lat_lng_are_numeric_and_in_range(self):
+        """lat must be in [-90, 90], lng in [-180, 180]."""
+        bad_lat = [r for r in VERIFIED_SIGHTINGS if not (-90 <= r["lat"] <= 90)]
+        bad_lng = [r for r in VERIFIED_SIGHTINGS if not (-180 <= r["lng"] <= 180)]
+        assert not bad_lat, f"{len(bad_lat)} records with lat out of range"
+        assert not bad_lng, f"{len(bad_lng)} records with lng out of range"
+
+
+class TestLoadVerifiedSightings:
+    def setup_method(self):
+        self.df = load_verified_sightings()
+
+    def test_returns_dataframe(self):
+        """load_verified_sightings must return a pandas DataFrame."""
+        import pandas as pd
+        assert isinstance(self.df, pd.DataFrame)
+
+    def test_row_count_matches_list(self):
+        """DataFrame must have the same number of rows as VERIFIED_SIGHTINGS."""
+        assert len(self.df) == len(VERIFIED_SIGHTINGS)
+
+    def test_required_columns_present(self):
+        """All required output columns must be present."""
+        required = {"date", "utc_dt", "lat", "lng", "elevation_m", "prayer", "source", "notes"}
+        missing = required - set(self.df.columns)
+        assert not missing, f"Missing columns: {missing}"
+
+    def test_utc_dt_is_timezone_aware(self):
+        """utc_dt column must contain timezone-aware datetimes in UTC."""
+        for dt in self.df["utc_dt"].head(10):
+            assert dt.tzinfo is not None, f"utc_dt {dt} is not timezone-aware"
+            assert dt.tzinfo == timezone.utc, f"utc_dt {dt} is not UTC"
+
+    def test_prayer_column_values(self):
+        """prayer column must only contain 'fajr' or 'isha'."""
+        unique = set(self.df["prayer"].unique())
+        assert unique <= {"fajr", "isha"}, f"Unexpected prayer values: {unique}"
+
+    def test_utc_conversion_blackburn_fajr(self):
+        """Spot-check UTC conversion for a known Blackburn BST record.
+
+        1987-09-21, 05:30 BST (UTC+1) should convert to 04:30 UTC.
+        """
+        blackburn = self.df[
+            (self.df["date"].astype(str) == "1987-09-21")
+            & (self.df["prayer"] == "fajr")
+            & (self.df["lat"].round(2) == 53.75)
+        ]
+        assert len(blackburn) >= 1, "Could not find Blackburn 1987-09-21 fajr record"
+        utc_dt = blackburn.iloc[0]["utc_dt"]
+        assert utc_dt.hour == 4, f"Expected hour 4 (UTC), got {utc_dt.hour}"
+        assert utc_dt.minute == 30, f"Expected minute 30, got {utc_dt.minute}"
--- a/tests/test_sightings_stats.py
+++ b/tests/test_sightings_stats.py
@ -0,0 +1,173 @@
+"""
+Tests for src/collect/analysis/sightings_stats.py
+
+Verifies angle_summary, geographic_coverage, and print_dataset_report
+return correct types and values. Plot functions are tested for return type only
+(no display).
+"""
+
+import sys
+sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent.parent))
+
+import pytest
+import pandas as pd
+from datetime import datetime, timezone
+
+from src.collect.analysis.sightings_stats import (
+    angle_summary,
+    geographic_coverage,
+    plot_angle_distribution,
+    plot_angle_vs_latitude,
+    plot_angle_vs_day_of_year,
+    print_dataset_report,
+)
+
+
+def _make_fajr_df(n=10):
+    """Build a minimal Fajr DataFrame with day_of_year and fajr_angle."""
+    rows = []
+    for i in range(n):
+        day = (i * 36) + 1  # spread across the year
+        dt = datetime(2023, 1, 1, 4, 0, tzinfo=timezone.utc)
+        rows.append({
+            "date": f"2023-01-{i+1:02d}",
+            "utc_dt": dt,
+            "lat": 30.0 + i * 0.5,
+            "lng": 29.0,
+            "elevation_m": 100.0,
+            "prayer": "fajr",
+            "fajr_angle": 12.0 + i * 0.3,
+            "day_of_year": day,
+            "source": f"Source {i}",
+            "notes": "test",
+        })
+    return pd.DataFrame(rows)
+
+
+def _make_isha_df(n=5):
+    """Build a minimal Isha DataFrame."""
+    rows = []
+    for i in range(n):
+        rows.append({
+            "date": f"2023-03-{i+1:02d}",
+            "utc_dt": datetime(2023, 3, i+1, 19, 0, tzinfo=timezone.utc),
+            "lat": 25.0 + i,
+            "lng": 46.0,
+            "elevation_m": 620.0,
+            "prayer": "isha",
+            "isha_angle": 17.0 + i * 0.2,
+            "day_of_year": 60 + i,
+            "source": f"Isha Source {i}",
+            "notes": "test",
+        })
+    return pd.DataFrame(rows)
+
+
+class TestAngleSummary:
+    def test_returns_series(self):
+        """angle_summary must return a pandas Series."""
+        df = _make_fajr_df()
+        result = angle_summary(df, "fajr_angle")
+        assert isinstance(result, pd.Series)
+
+    def test_count_is_correct(self):
+        """count in summary must equal number of non-NaN angle rows."""
+        df = _make_fajr_df(10)
+        result = angle_summary(df, "fajr_angle")
+        assert result["count"] == 10.0
+
+    def test_mean_is_plausible(self):
+        """Mean angle must be within the data range."""
+        df = _make_fajr_df(10)
+        result = angle_summary(df, "fajr_angle")
+        assert df["fajr_angle"].min() <= result["mean"] <= df["fajr_angle"].max()
+
+    def test_isha_angle_col(self):
+        """Works correctly with isha_angle column."""
+        df = _make_isha_df()
+        result = angle_summary(df, "isha_angle")
+        assert result["count"] == 5.0
+
+
+class TestGeographicCoverage:
+    def test_returns_dict(self):
+        """geographic_coverage must return a dict."""
+        df = _make_fajr_df()
+        result = geographic_coverage(df)
+        assert isinstance(result, dict)
+
+    def test_required_keys_present(self):
+        """All expected keys must be in the result dict."""
+        df = _make_fajr_df()
+        result = geographic_coverage(df)
+        for key in ("lat_min", "lat_max", "unique_locs", "date_min", "date_max", "total_records"):
+            assert key in result, f"Missing key: {key}"
+
+    def test_total_records_correct(self):
+        """total_records must equal len(df)."""
+        df = _make_fajr_df(7)
+        result = geographic_coverage(df)
+        assert result["total_records"] == 7
+
+    def test_lat_range_correct(self):
+        """lat_min and lat_max must match DataFrame lat column bounds."""
+        df = _make_fajr_df(10)
+        result = geographic_coverage(df)
+        assert abs(result["lat_min"] - df["lat"].min()) < 1e-6
+        assert abs(result["lat_max"] - df["lat"].max()) < 1e-6
+
+    def test_unique_locs_counts_groups(self):
+        """unique_locs must count distinct (lat, lng) pairs."""
+        df = _make_fajr_df(10)
+        expected = len(df.groupby(["lat", "lng"]))
+        result = geographic_coverage(df)
+        assert result["unique_locs"] == expected
+
+
+class TestPrintDatasetReport:
+    def test_runs_without_error(self, capsys):
+        """print_dataset_report must run to completion without exceptions."""
+        fajr = _make_fajr_df(5)
+        isha = _make_isha_df(3)
+        print_dataset_report(fajr, isha)
+        out = capsys.readouterr().out
+        assert "Fajr dataset: 5" in out
+        assert "Isha dataset: 3" in out
+
+    def test_empty_datasets(self, capsys):
+        """print_dataset_report handles empty DataFrames without crashing."""
+        fajr = pd.DataFrame(columns=["fajr_angle"])
+        isha = pd.DataFrame(columns=["isha_angle"])
+        print_dataset_report(fajr, isha)
+        out = capsys.readouterr().out
+        assert "Fajr dataset: 0" in out
+        assert "Isha dataset: 0" in out
+
+
+class TestPlotFunctions:
+    def test_plot_angle_distribution_returns_figure(self):
+        """plot_angle_distribution must return a matplotlib Figure."""
+        import matplotlib
+        matplotlib.use("Agg")  # non-interactive backend for tests
+        df = _make_fajr_df(10)
+        fig = plot_angle_distribution(df, "fajr_angle")
+        import matplotlib.figure
+        assert isinstance(fig, matplotlib.figure.Figure)
+
+    def test_plot_angle_vs_latitude_returns_figure(self):
+        """plot_angle_vs_latitude must return a matplotlib Figure."""
+        import matplotlib
+        matplotlib.use("Agg")
+        df = _make_fajr_df(10)
+        fig = plot_angle_vs_latitude(df, "fajr_angle")
+        import matplotlib.figure
+        assert isinstance(fig, matplotlib.figure.Figure)
+
+    def test_plot_angle_vs_day_of_year_returns_figure(self):
+        """plot_angle_vs_day_of_year must return a matplotlib Figure."""
+        import matplotlib
+        matplotlib.use("Agg")
+        df = _make_fajr_df(10)
+        fig = plot_angle_vs_day_of_year(df, "fajr_angle")
+        import matplotlib.figure
+        assert isinstance(fig, matplotlib.figure.Figure)