mirror of
https://github.com/acamarata/pray-calc-ml.git
synced 2026-07-01 11:24:26 +00:00
- Migrate .wiki/ to .github/wiki/ (GCI standard for public repos) - Add _Sidebar.md for GitHub Wiki navigation - Update wiki-sync.yml to reference .github/wiki/ path - Remove .markdownlintignore (covered by .vscode/settings.json) - Migrate .allow-ai-terms to ALLOW_AI_TERMS_REPOS in pre-commit hook - Expand .gitignore with full IDE and AI agent directory list - Update README project structure reference
227 lines
8 KiB
Markdown
227 lines
8 KiB
Markdown
# Architecture
|
|
|
|
This page explains how the pipeline works end-to-end: how raw sighting records become
|
|
training data, what each module does, and how the pieces fit together.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
```
|
|
Raw sighting data
|
|
↓
|
|
[openfajr.py] OpenFajr iCal feed (Birmingham, UK, 2016-present)
|
|
[sightings.py] Manually compiled records (35+ locations worldwide)
|
|
[geocode.py] Geocoding: city/region names → lat/lng
|
|
↓
|
|
Standardized records: { date, lat, lng, elevation_m, local_time, utc_offset }
|
|
↓
|
|
[elevation.py] Open-Elevation API: fill missing elevation_m values
|
|
↓
|
|
[angle_calc.py] PyEphem back-calculation: UTC moment → solar depression angle
|
|
↓
|
|
[pipeline.py] Quality filter: drop implausible angles (< 7° Fajr / < 10° Isha)
|
|
↓
|
|
data/processed/fajr_angles.csv
|
|
data/processed/isha_angles.csv
|
|
↓
|
|
[01_exploratory_analysis.ipynb] EDA + linear baseline + gradient boosting
|
|
```
|
|
|
|
---
|
|
|
|
## Modules
|
|
|
|
### `src/pipeline.py`
|
|
|
|
The master script. Runs all steps in sequence.
|
|
|
|
```
|
|
python -m src.pipeline [--no-elevation-lookup]
|
|
```
|
|
|
|
Responsibilities:
|
|
1. Call `openfajr.load()` and `verified_sightings.load()` to get raw records
|
|
2. Call `elevation.enrich()` to fill missing elevation values
|
|
3. Call `angle_calc.compute()` for each record
|
|
4. Drop records with implausible angles
|
|
5. Write `fajr_angles.csv` and `isha_angles.csv`
|
|
|
|
### `src/angle_calc.py`
|
|
|
|
The back-calculation engine. Takes a confirmed sighting record and returns the solar
|
|
depression angle at the observed moment.
|
|
|
|
**Method:**
|
|
1. Convert local time to UTC: `utc = local_dt - timedelta(hours=utc_offset)`
|
|
2. Set up a `PyEphem.Observer` with:
|
|
- `lat` / `lon` from the record
|
|
- `elevation` in metres
|
|
- `pressure = 1013.25` hPa (standard atmosphere)
|
|
- `temp = 15.0` °C (standard atmosphere)
|
|
3. Set `observer.date` to the UTC datetime
|
|
4. Call `ephem.Sun(observer)` to get the Sun's position
|
|
5. `depression_angle = -math.degrees(sun.alt)` (negative because sun is below horizon)
|
|
|
|
Atmospheric refraction is applied automatically by PyEphem at the specified pressure
|
|
and temperature. This is important: near the horizon, refraction can lift the apparent
|
|
solar disk by 0.5°-1.0°.
|
|
|
|
### `src/collect/openfajr.py`
|
|
|
|
Fetches and parses the OpenFajr Birmingham iCal feed from `calendar.google.com`.
|
|
|
|
The feed contains one `VEVENT` per day. The `DTSTART` field uses a `Z` suffix indicating
|
|
UTC. The `SUMMARY` field identifies the prayer type.
|
|
|
|
Known issue: around BST transition dates (late March, late October), a small number of
|
|
records have UTC times that produce physically impossible depression angles (sun above
|
|
horizon, or angle < 7°). These are caught by the quality filter.
|
|
|
|
### `src/collect/verified_sightings.py`
|
|
|
|
A Python list of manually compiled sighting records. Each record is a dictionary with:
|
|
|
|
| Field | Type | Description |
|
|
| --- | --- | --- |
|
|
| `prayer` | `"fajr"` or `"isha"` | Which prayer the sighting confirms |
|
|
| `date_local` | `"YYYY-MM-DD"` | Calendar date at the sighting location |
|
|
| `time_local` | `"HH:MM"` | 24-hour local time |
|
|
| `utc_offset` | `float` | Hours from UTC |
|
|
| `lat` | `float` | Decimal degrees (north positive) |
|
|
| `lng` | `float` | Decimal degrees (east positive) |
|
|
| `elevation_m` | `float` | Metres ASL (0 = will be looked up) |
|
|
| `source` | `str` | Citation |
|
|
| `notes` | `str` | Observer notes |
|
|
|
|
### `src/geocode.py`
|
|
|
|
Geocoding module. Converts city or region names to lat/lng coordinates using the
|
|
Nominatim API (OpenStreetMap). Used during the data ingestion pipeline when records
|
|
are provided with location names rather than explicit coordinates.
|
|
|
|
Caches results in `data/raw/geocode_cache.json` to avoid redundant API calls.
|
|
|
|
### `src/elevation.py`
|
|
|
|
Queries the Open-Elevation API for records where `elevation_m == 0`.
|
|
|
|
Batches requests (max 100 per call). Writes results back to the record dict.
|
|
|
|
---
|
|
|
|
## Data Flow in Detail
|
|
|
|
### 1. Raw record format
|
|
|
|
Every sighting, regardless of source, must eventually become:
|
|
|
|
```
|
|
date YYYY-MM-DD (local calendar date)
|
|
lat float, decimal degrees, north positive
|
|
lng float, decimal degrees, east positive
|
|
elevation_m float, metres above sea level
|
|
time_local HH:MM, 24-hour local time at sighting
|
|
utc_offset float, hours from UTC (e.g. 1.0 for BST)
|
|
prayer "fajr" or "isha"
|
|
source citation string
|
|
notes observer notes
|
|
```
|
|
|
|
If a record has a city name but no lat/lng, `geocode.py` fills it in.
|
|
If a record has `elevation_m == 0`, `elevation.py` fills it via the Open-Elevation API.
|
|
|
|
### 2. UTC conversion
|
|
|
|
```
|
|
utc_datetime = date + time_local - utc_offset (hours)
|
|
```
|
|
|
|
This is the single most error-prone step. Common failure modes:
|
|
- Using the wrong UTC offset (e.g. forgetting summer/winter DST)
|
|
- Using the standard timezone offset when the sighting date was in the alternate season
|
|
- Using the nominal timezone when the actual location's offset differs (e.g. parts of India)
|
|
|
|
All manually compiled records in `verified_sightings.py` include explicit `utc_offset`
|
|
values per-date, not per-timezone-name. This avoids DST ambiguity.
|
|
|
|
### 3. Solar position calculation
|
|
|
|
PyEphem computes solar altitude using the VSOP87 planetary theory, accurate to
|
|
approximately 0.01°. Atmospheric refraction is the main source of uncertainty:
|
|
the standard atmosphere model (1013.25 hPa, 15°C) is a good average but actual
|
|
refraction varies with local conditions. For twilight observations near -12° altitude,
|
|
refraction contributes negligibly.
|
|
|
|
**Depression angle = -altitude.** When the sun is below the horizon, `ephem.Sun.alt`
|
|
is negative. The depression angle is the absolute value.
|
|
|
|
### 4. Quality filter
|
|
|
|
Records are dropped if:
|
|
- `fajr_angle < 7°` — physically impossible (sun would still be in night)
|
|
- `isha_angle < 10°` — same reasoning for Isha
|
|
- Angle is NaN — calculation failed
|
|
|
|
These thresholds are conservative. Genuine sighting records produce 8°-21° for Fajr
|
|
and 11°-22° for Isha. Values below 7° / 10° indicate a data entry error, most commonly
|
|
a UTC offset mistake or a DST clock-change artifact.
|
|
|
|
---
|
|
|
|
## Output Schema
|
|
|
|
Both output CSVs share this schema:
|
|
|
|
| Column | Type | Description |
|
|
| --- | --- | --- |
|
|
| `date` | string | YYYY-MM-DD local date |
|
|
| `utc_dt` | string | ISO 8601 UTC datetime |
|
|
| `lat` | float | Decimal degrees |
|
|
| `lng` | float | Decimal degrees |
|
|
| `elevation_m` | float | Metres ASL |
|
|
| `day_of_year` | int | 1-366 |
|
|
| `fajr_angle` or `isha_angle` | float | Solar depression angle (°) |
|
|
| `source` | string | Citation |
|
|
| `notes` | string | Observer notes |
|
|
|
|
---
|
|
|
|
## Source Hierarchy
|
|
|
|
Records are ranked by data quality:
|
|
|
|
| Tier | Source type | Example |
|
|
| --- | --- | --- |
|
|
| 1 | Community astrophotography, panel-voted | OpenFajr Birmingham |
|
|
| 2 | DSLR + SQM instrumental observation | Kassim Bahali 2018 Malaysia |
|
|
| 3 | SQM photometry only | Saksono 2020 Indonesia |
|
|
| 4 | Multi-observer naked-eye, documented | Asim Yusuf UK, Hizbul Ulama UK |
|
|
| 5 | Single trained observer, per-date log | NRIAG Egypt individual nights |
|
|
| 6 | Published mean per season, time inferred | Hail Saudi Arabia (seasonal means) |
|
|
|
|
Tier 6 records (inferred times) are marked in `notes`. They contribute to geographic
|
|
diversity but carry more uncertainty than direct observations.
|
|
|
|
---
|
|
|
|
## Known Limitations
|
|
|
|
1. **Birmingham dominance.** The OpenFajr dataset provides ~4,000 records but all from
|
|
one location at 52.5°N. Any ML model trained on this data will extrapolate to all
|
|
other latitudes. Geographic diversity is the primary gap.
|
|
|
|
2. **Isha data scarcity.** Only ~43 Isha records vs ~4,100 Fajr records. The Isha network
|
|
depends on Shafaq al-Abyad observations, which are less systematically documented.
|
|
|
|
3. **Atmospheric variability.** The standard atmosphere model (1013.25 hPa, 15°C) does
|
|
not capture day-to-day refraction variation. On cold clear nights, refraction is
|
|
higher; on hot dry nights, lower. This introduces ~0.1°-0.3° uncertainty per record.
|
|
|
|
4. **Observer skill variation.** Naked-eye observations depend on the observer's dark
|
|
adaptation, experience, and site conditions. The depression angle for a given
|
|
"true dawn" varies across observers by up to 2°.
|
|
|
|
---
|
|
|
|
*[← ML Crunching](ML-Crunching) · [Data Sources →](Data-Sources)*
|