pray-calc-ml/data/SCHEMA.md

137 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dataset Schema: Verified Twilight Sightings
Documents all columns in the processed verified-sightings CSV files used for ML training.
**Files covered:**
- `data/processed/fajr_angles.csv` — Fajr (pre-dawn) twilight observations
- `data/processed/isha_angles.csv` — Isha (post-dusk) twilight observations
**Row count (as of latest build):**
- fajr_angles.csv: 48,668 rows
- isha_angles.csv: 34,529 rows
---
## Column Definitions
### `date`
- **Type:** string (ISO 8601 date)
- **Format:** `YYYY-MM-DD`
- **Example:** `2024-02-02`
- **Description:** Calendar date of the observation in UTC. Used for time-aware train/test splitting. Not used directly as a model feature.
- **Null policy:** Never null. Every row has a valid date.
- **Units:** N/A (date string)
---
### `utc_dt`
- **Type:** string (ISO 8601 datetime with UTC timezone)
- **Format:** `YYYY-MM-DD HH:MM:SS+00:00`
- **Example:** `2024-02-02 00:12:00+00:00`
- **Description:** UTC timestamp of the twilight event. The time component captures the moment of observation, used when computing solar position at the exact instant. Not used directly as a model feature to avoid timestamp leakage.
- **Null policy:** Never null.
- **Units:** N/A (timestamp)
---
### `lat`
- **Type:** float
- **Range:** -90.0 to 90.0
- **Example:** `-62.59334`
- **Description:** Geographic latitude of the observation site in decimal degrees. Negative values are south of the equator.
- **Null policy:** Never null.
- **Units:** Decimal degrees
---
### `lng`
- **Type:** float
- **Range:** -180.0 to 180.0
- **Example:** `15.46875`
- **Description:** Geographic longitude of the observation site in decimal degrees. Negative values are west of the prime meridian.
- **Null policy:** Never null.
- **Units:** Decimal degrees
---
### `elevation_m`
- **Type:** float
- **Range:** Unconstrained (negative values occur over ocean; positive values over land)
- **Example:** `-5007.54` (ocean), `432.0` (land station)
- **Description:** Elevation of the observation site in meters above sea level. Negative values are artifacts from ocean-based sky brightness sensors where the bathymetric elevation was used. Elevation affects the optical path through the atmosphere, which in turn affects the apparent twilight depression angle.
- **Null policy:** Never null.
- **Units:** Meters
---
### `day_of_year`
- **Type:** integer
- **Range:** 1 to 365 (366 in leap years)
- **Example:** `33`
- **Description:** Julian day of the year for the observation date. Day 1 is January 1. Used as a cyclic feature in the ML pipeline (encoded as sin/cos pair to avoid the year-boundary discontinuity). Correlated with solar declination.
- **Null policy:** Never null.
- **Units:** Day number (dimensionless)
---
### `fajr_angle` (fajr_angles.csv only)
- **Type:** float
- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
- **Example:** `9.508`
- **Description:** Computed solar depression angle below the horizon at the moment of Fajr (pre-dawn astronomical twilight onset). This is the target variable (y) for Fajr ML training. Positive values indicate degrees below the horizon. Computed by back-calculating the sun's altitude at the recorded twilight timestamp using NREL SPA.
- **Null policy:** Never null. Rows with uncomputable angles are excluded during cleaning.
- **Units:** Degrees (below horizon, positive)
---
### `isha_angle` (isha_angles.csv only)
- **Type:** float
- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
- **Example:** `16.074`
- **Description:** Computed solar depression angle below the horizon at the moment of Isha (post-dusk astronomical twilight end). This is the target variable (y) for Isha ML training. Same convention as fajr_angle: positive values are degrees below the horizon.
- **Null policy:** Never null.
- **Units:** Degrees (below horizon, positive)
---
### `source`
- **Type:** string (categorical)
- **Values:** `globe_at_night`, `bsrn`, `surfrad`, `galicia_sqm`, `madrid_sqm`, `india_twilight`, `majadahonda_sqm`, `gan_mn`, `tess`, `washetdonker`
- **Example:** `globe_at_night`
- **Description:** Data source identifier. Identifies which collection pipeline produced this row. Not used as a model feature (categorical with potential label-leakage). Used for stratified analysis and debugging.
- **Null policy:** Never null.
- **Units:** N/A (categorical label)
---
### `notes`
- **Type:** string (free text)
- **Example:** `""` (empty string is most common)
- **Description:** Optional annotation attached during data cleaning or manual curation. May contain flags like `high_elevation`, `coastal`, or quality notes from the source dataset. Not used in ML training.
- **Null policy:** May be empty string. Treat as optional metadata.
- **Units:** N/A (free text)
---
## Derived Features Used in Training
The following columns are computed during training and are not stored in the CSV:
| Derived feature | Formula | Purpose |
|---|---|---|
| `sin_doy` | sin(2π × day_of_year / 365) | Cyclic seasonal encoding |
| `cos_doy` | cos(2π × day_of_year / 365) | Cyclic seasonal encoding pair |
| `abs_lat` | abs(lat) | Polar proximity (symmetric across equator) |
These are constructed by `src/evaluate.py` at training time. See `docs/ml-training-plan.md`
for the full feature engineering rationale.