mirror of
https://github.com/acamarata/pray-calc-ml.git
synced 2026-07-01 03:14:27 +00:00
137 lines
5.2 KiB
Markdown
137 lines
5.2 KiB
Markdown
# Dataset Schema: Verified Twilight Sightings
|
||
|
||
Documents all columns in the processed verified-sightings CSV files used for ML training.
|
||
|
||
**Files covered:**
|
||
- `data/processed/fajr_angles.csv` — Fajr (pre-dawn) twilight observations
|
||
- `data/processed/isha_angles.csv` — Isha (post-dusk) twilight observations
|
||
|
||
**Row count (as of latest build):**
|
||
- fajr_angles.csv: 48,668 rows
|
||
- isha_angles.csv: 34,529 rows
|
||
|
||
---
|
||
|
||
## Column Definitions
|
||
|
||
### `date`
|
||
|
||
- **Type:** string (ISO 8601 date)
|
||
- **Format:** `YYYY-MM-DD`
|
||
- **Example:** `2024-02-02`
|
||
- **Description:** Calendar date of the observation in UTC. Used for time-aware train/test splitting. Not used directly as a model feature.
|
||
- **Null policy:** Never null. Every row has a valid date.
|
||
- **Units:** N/A (date string)
|
||
|
||
---
|
||
|
||
### `utc_dt`
|
||
|
||
- **Type:** string (ISO 8601 datetime with UTC timezone)
|
||
- **Format:** `YYYY-MM-DD HH:MM:SS+00:00`
|
||
- **Example:** `2024-02-02 00:12:00+00:00`
|
||
- **Description:** UTC timestamp of the twilight event. The time component captures the moment of observation, used when computing solar position at the exact instant. Not used directly as a model feature to avoid timestamp leakage.
|
||
- **Null policy:** Never null.
|
||
- **Units:** N/A (timestamp)
|
||
|
||
---
|
||
|
||
### `lat`
|
||
|
||
- **Type:** float
|
||
- **Range:** -90.0 to 90.0
|
||
- **Example:** `-62.59334`
|
||
- **Description:** Geographic latitude of the observation site in decimal degrees. Negative values are south of the equator.
|
||
- **Null policy:** Never null.
|
||
- **Units:** Decimal degrees
|
||
|
||
---
|
||
|
||
### `lng`
|
||
|
||
- **Type:** float
|
||
- **Range:** -180.0 to 180.0
|
||
- **Example:** `15.46875`
|
||
- **Description:** Geographic longitude of the observation site in decimal degrees. Negative values are west of the prime meridian.
|
||
- **Null policy:** Never null.
|
||
- **Units:** Decimal degrees
|
||
|
||
---
|
||
|
||
### `elevation_m`
|
||
|
||
- **Type:** float
|
||
- **Range:** Unconstrained (negative values occur over ocean; positive values over land)
|
||
- **Example:** `-5007.54` (ocean), `432.0` (land station)
|
||
- **Description:** Elevation of the observation site in meters above sea level. Negative values are artifacts from ocean-based sky brightness sensors where the bathymetric elevation was used. Elevation affects the optical path through the atmosphere, which in turn affects the apparent twilight depression angle.
|
||
- **Null policy:** Never null.
|
||
- **Units:** Meters
|
||
|
||
---
|
||
|
||
### `day_of_year`
|
||
|
||
- **Type:** integer
|
||
- **Range:** 1 to 365 (366 in leap years)
|
||
- **Example:** `33`
|
||
- **Description:** Julian day of the year for the observation date. Day 1 is January 1. Used as a cyclic feature in the ML pipeline (encoded as sin/cos pair to avoid the year-boundary discontinuity). Correlated with solar declination.
|
||
- **Null policy:** Never null.
|
||
- **Units:** Day number (dimensionless)
|
||
|
||
---
|
||
|
||
### `fajr_angle` (fajr_angles.csv only)
|
||
|
||
- **Type:** float
|
||
- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
|
||
- **Example:** `9.508`
|
||
- **Description:** Computed solar depression angle below the horizon at the moment of Fajr (pre-dawn astronomical twilight onset). This is the target variable (y) for Fajr ML training. Positive values indicate degrees below the horizon. Computed by back-calculating the sun's altitude at the recorded twilight timestamp using NREL SPA.
|
||
- **Null policy:** Never null. Rows with uncomputable angles are excluded during cleaning.
|
||
- **Units:** Degrees (below horizon, positive)
|
||
|
||
---
|
||
|
||
### `isha_angle` (isha_angles.csv only)
|
||
|
||
- **Type:** float
|
||
- **Range:** Typically 5.0 to 25.0 degrees (outliers possible)
|
||
- **Example:** `16.074`
|
||
- **Description:** Computed solar depression angle below the horizon at the moment of Isha (post-dusk astronomical twilight end). This is the target variable (y) for Isha ML training. Same convention as fajr_angle: positive values are degrees below the horizon.
|
||
- **Null policy:** Never null.
|
||
- **Units:** Degrees (below horizon, positive)
|
||
|
||
---
|
||
|
||
### `source`
|
||
|
||
- **Type:** string (categorical)
|
||
- **Values:** `globe_at_night`, `bsrn`, `surfrad`, `galicia_sqm`, `madrid_sqm`, `india_twilight`, `majadahonda_sqm`, `gan_mn`, `tess`, `washetdonker`
|
||
- **Example:** `globe_at_night`
|
||
- **Description:** Data source identifier. Identifies which collection pipeline produced this row. Not used as a model feature (categorical with potential label-leakage). Used for stratified analysis and debugging.
|
||
- **Null policy:** Never null.
|
||
- **Units:** N/A (categorical label)
|
||
|
||
---
|
||
|
||
### `notes`
|
||
|
||
- **Type:** string (free text)
|
||
- **Example:** `""` (empty string is most common)
|
||
- **Description:** Optional annotation attached during data cleaning or manual curation. May contain flags like `high_elevation`, `coastal`, or quality notes from the source dataset. Not used in ML training.
|
||
- **Null policy:** May be empty string. Treat as optional metadata.
|
||
- **Units:** N/A (free text)
|
||
|
||
---
|
||
|
||
## Derived Features Used in Training
|
||
|
||
The following columns are computed during training and are not stored in the CSV:
|
||
|
||
| Derived feature | Formula | Purpose |
|
||
|---|---|---|
|
||
| `sin_doy` | sin(2π × day_of_year / 365) | Cyclic seasonal encoding |
|
||
| `cos_doy` | cos(2π × day_of_year / 365) | Cyclic seasonal encoding pair |
|
||
| `abs_lat` | abs(lat) | Polar proximity (symmetric across equator) |
|
||
|
||
These are constructed by `src/evaluate.py` at training time. See `docs/ml-training-plan.md`
|
||
for the full feature engineering rationale.
|