pray-calc-ml/.github/wiki/Architecture.md
Aric Camarata d8471f8ca5 chore: superclean compliance pass
- Migrate .wiki/ to .github/wiki/ (GCI standard for public repos)
- Add _Sidebar.md for GitHub Wiki navigation
- Update wiki-sync.yml to reference .github/wiki/ path
- Remove .markdownlintignore (covered by .vscode/settings.json)
- Migrate .allow-ai-terms to ALLOW_AI_TERMS_REPOS in pre-commit hook
- Expand .gitignore with full IDE and AI agent directory list
- Update README project structure reference
2026-02-28 11:55:08 -05:00

227 lines
8 KiB
Markdown

# Architecture
This page explains how the pipeline works end-to-end: how raw sighting records become
training data, what each module does, and how the pieces fit together.
---
## Overview
```
Raw sighting data
[openfajr.py] OpenFajr iCal feed (Birmingham, UK, 2016-present)
[sightings.py] Manually compiled records (35+ locations worldwide)
[geocode.py] Geocoding: city/region names → lat/lng
Standardized records: { date, lat, lng, elevation_m, local_time, utc_offset }
[elevation.py] Open-Elevation API: fill missing elevation_m values
[angle_calc.py] PyEphem back-calculation: UTC moment → solar depression angle
[pipeline.py] Quality filter: drop implausible angles (< 7° Fajr / < 10° Isha)
data/processed/fajr_angles.csv
data/processed/isha_angles.csv
[01_exploratory_analysis.ipynb] EDA + linear baseline + gradient boosting
```
---
## Modules
### `src/pipeline.py`
The master script. Runs all steps in sequence.
```
python -m src.pipeline [--no-elevation-lookup]
```
Responsibilities:
1. Call `openfajr.load()` and `verified_sightings.load()` to get raw records
2. Call `elevation.enrich()` to fill missing elevation values
3. Call `angle_calc.compute()` for each record
4. Drop records with implausible angles
5. Write `fajr_angles.csv` and `isha_angles.csv`
### `src/angle_calc.py`
The back-calculation engine. Takes a confirmed sighting record and returns the solar
depression angle at the observed moment.
**Method:**
1. Convert local time to UTC: `utc = local_dt - timedelta(hours=utc_offset)`
2. Set up a `PyEphem.Observer` with:
- `lat` / `lon` from the record
- `elevation` in metres
- `pressure = 1013.25` hPa (standard atmosphere)
- `temp = 15.0` °C (standard atmosphere)
3. Set `observer.date` to the UTC datetime
4. Call `ephem.Sun(observer)` to get the Sun's position
5. `depression_angle = -math.degrees(sun.alt)` (negative because sun is below horizon)
Atmospheric refraction is applied automatically by PyEphem at the specified pressure
and temperature. This is important: near the horizon, refraction can lift the apparent
solar disk by 0.5°-1.0°.
### `src/collect/openfajr.py`
Fetches and parses the OpenFajr Birmingham iCal feed from `calendar.google.com`.
The feed contains one `VEVENT` per day. The `DTSTART` field uses a `Z` suffix indicating
UTC. The `SUMMARY` field identifies the prayer type.
Known issue: around BST transition dates (late March, late October), a small number of
records have UTC times that produce physically impossible depression angles (sun above
horizon, or angle < 7°). These are caught by the quality filter.
### `src/collect/verified_sightings.py`
A Python list of manually compiled sighting records. Each record is a dictionary with:
| Field | Type | Description |
| --- | --- | --- |
| `prayer` | `"fajr"` or `"isha"` | Which prayer the sighting confirms |
| `date_local` | `"YYYY-MM-DD"` | Calendar date at the sighting location |
| `time_local` | `"HH:MM"` | 24-hour local time |
| `utc_offset` | `float` | Hours from UTC |
| `lat` | `float` | Decimal degrees (north positive) |
| `lng` | `float` | Decimal degrees (east positive) |
| `elevation_m` | `float` | Metres ASL (0 = will be looked up) |
| `source` | `str` | Citation |
| `notes` | `str` | Observer notes |
### `src/geocode.py`
Geocoding module. Converts city or region names to lat/lng coordinates using the
Nominatim API (OpenStreetMap). Used during the data ingestion pipeline when records
are provided with location names rather than explicit coordinates.
Caches results in `data/raw/geocode_cache.json` to avoid redundant API calls.
### `src/elevation.py`
Queries the Open-Elevation API for records where `elevation_m == 0`.
Batches requests (max 100 per call). Writes results back to the record dict.
---
## Data Flow in Detail
### 1. Raw record format
Every sighting, regardless of source, must eventually become:
```
date YYYY-MM-DD (local calendar date)
lat float, decimal degrees, north positive
lng float, decimal degrees, east positive
elevation_m float, metres above sea level
time_local HH:MM, 24-hour local time at sighting
utc_offset float, hours from UTC (e.g. 1.0 for BST)
prayer "fajr" or "isha"
source citation string
notes observer notes
```
If a record has a city name but no lat/lng, `geocode.py` fills it in.
If a record has `elevation_m == 0`, `elevation.py` fills it via the Open-Elevation API.
### 2. UTC conversion
```
utc_datetime = date + time_local - utc_offset (hours)
```
This is the single most error-prone step. Common failure modes:
- Using the wrong UTC offset (e.g. forgetting summer/winter DST)
- Using the standard timezone offset when the sighting date was in the alternate season
- Using the nominal timezone when the actual location's offset differs (e.g. parts of India)
All manually compiled records in `verified_sightings.py` include explicit `utc_offset`
values per-date, not per-timezone-name. This avoids DST ambiguity.
### 3. Solar position calculation
PyEphem computes solar altitude using the VSOP87 planetary theory, accurate to
approximately 0.01°. Atmospheric refraction is the main source of uncertainty:
the standard atmosphere model (1013.25 hPa, 15°C) is a good average but actual
refraction varies with local conditions. For twilight observations near -12° altitude,
refraction contributes negligibly.
**Depression angle = -altitude.** When the sun is below the horizon, `ephem.Sun.alt`
is negative. The depression angle is the absolute value.
### 4. Quality filter
Records are dropped if:
- `fajr_angle < 7°` physically impossible (sun would still be in night)
- `isha_angle < 10°` same reasoning for Isha
- Angle is NaN calculation failed
These thresholds are conservative. Genuine sighting records produce 8°-21° for Fajr
and 11°-22° for Isha. Values below 7° / 10° indicate a data entry error, most commonly
a UTC offset mistake or a DST clock-change artifact.
---
## Output Schema
Both output CSVs share this schema:
| Column | Type | Description |
| --- | --- | --- |
| `date` | string | YYYY-MM-DD local date |
| `utc_dt` | string | ISO 8601 UTC datetime |
| `lat` | float | Decimal degrees |
| `lng` | float | Decimal degrees |
| `elevation_m` | float | Metres ASL |
| `day_of_year` | int | 1-366 |
| `fajr_angle` or `isha_angle` | float | Solar depression angle (°) |
| `source` | string | Citation |
| `notes` | string | Observer notes |
---
## Source Hierarchy
Records are ranked by data quality:
| Tier | Source type | Example |
| --- | --- | --- |
| 1 | Community astrophotography, panel-voted | OpenFajr Birmingham |
| 2 | DSLR + SQM instrumental observation | Kassim Bahali 2018 Malaysia |
| 3 | SQM photometry only | Saksono 2020 Indonesia |
| 4 | Multi-observer naked-eye, documented | Asim Yusuf UK, Hizbul Ulama UK |
| 5 | Single trained observer, per-date log | NRIAG Egypt individual nights |
| 6 | Published mean per season, time inferred | Hail Saudi Arabia (seasonal means) |
Tier 6 records (inferred times) are marked in `notes`. They contribute to geographic
diversity but carry more uncertainty than direct observations.
---
## Known Limitations
1. **Birmingham dominance.** The OpenFajr dataset provides ~4,000 records but all from
one location at 52.5°N. Any ML model trained on this data will extrapolate to all
other latitudes. Geographic diversity is the primary gap.
2. **Isha data scarcity.** Only ~43 Isha records vs ~4,100 Fajr records. The Isha network
depends on Shafaq al-Abyad observations, which are less systematically documented.
3. **Atmospheric variability.** The standard atmosphere model (1013.25 hPa, 15°C) does
not capture day-to-day refraction variation. On cold clear nights, refraction is
higher; on hot dry nights, lower. This introduces ~0.1°-0.3° uncertainty per record.
4. **Observer skill variation.** Naked-eye observations depend on the observer's dark
adaptation, experience, and site conditions. The depression angle for a given
"true dawn" varies across observers by up to 2°.
---
*[← ML Crunching](ML-Crunching) · [Data Sources →](Data-Sources)*