Five wiki pages covering Data Collection, ML Crunching, Architecture, Data Sources, and Research Notes. GitHub Actions workflow syncs .wiki/ to the GitHub Wiki on push to main. Adds .markdownlintignore and VS Code settings to exclude .claude/ from lint checks. Adds .allow-ai-terms to allow the .claude/ directory path reference in lint ignore files.
8 KiB
Architecture
This page explains how the pipeline works end-to-end: how raw sighting records become training data, what each module does, and how the pieces fit together.
Overview
Raw sighting data
↓
[openfajr.py] OpenFajr iCal feed (Birmingham, UK, 2016-present)
[sightings.py] Manually compiled records (35+ locations worldwide)
[geocode.py] Geocoding: city/region names → lat/lng
↓
Standardized records: { date, lat, lng, elevation_m, local_time, utc_offset }
↓
[elevation.py] Open-Elevation API: fill missing elevation_m values
↓
[angle_calc.py] PyEphem back-calculation: UTC moment → solar depression angle
↓
[pipeline.py] Quality filter: drop implausible angles (< 7° Fajr / < 10° Isha)
↓
data/processed/fajr_angles.csv
data/processed/isha_angles.csv
↓
[01_exploratory_analysis.ipynb] EDA + linear baseline + gradient boosting
Modules
src/pipeline.py
The master script. Runs all steps in sequence.
python -m src.pipeline [--no-elevation-lookup]
Responsibilities:
- Call
openfajr.load()andverified_sightings.load()to get raw records - Call
elevation.enrich()to fill missing elevation values - Call
angle_calc.compute()for each record - Drop records with implausible angles
- Write
fajr_angles.csvandisha_angles.csv
src/angle_calc.py
The back-calculation engine. Takes a confirmed sighting record and returns the solar depression angle at the observed moment.
Method:
- Convert local time to UTC:
utc = local_dt - timedelta(hours=utc_offset) - Set up a
PyEphem.Observerwith:lat/lonfrom the recordelevationin metrespressure = 1013.25hPa (standard atmosphere)temp = 15.0°C (standard atmosphere)
- Set
observer.dateto the UTC datetime - Call
ephem.Sun(observer)to get the Sun's position depression_angle = -math.degrees(sun.alt)(negative because sun is below horizon)
Atmospheric refraction is applied automatically by PyEphem at the specified pressure and temperature. This is important: near the horizon, refraction can lift the apparent solar disk by 0.5°-1.0°.
src/collect/openfajr.py
Fetches and parses the OpenFajr Birmingham iCal feed from calendar.google.com.
The feed contains one VEVENT per day. The DTSTART field uses a Z suffix indicating
UTC. The SUMMARY field identifies the prayer type.
Known issue: around BST transition dates (late March, late October), a small number of records have UTC times that produce physically impossible depression angles (sun above horizon, or angle < 7°). These are caught by the quality filter.
src/collect/verified_sightings.py
A Python list of manually compiled sighting records. Each record is a dictionary with:
| Field | Type | Description |
|---|---|---|
prayer |
"fajr" or "isha" |
Which prayer the sighting confirms |
date_local |
"YYYY-MM-DD" |
Calendar date at the sighting location |
time_local |
"HH:MM" |
24-hour local time |
utc_offset |
float |
Hours from UTC |
lat |
float |
Decimal degrees (north positive) |
lng |
float |
Decimal degrees (east positive) |
elevation_m |
float |
Metres ASL (0 = will be looked up) |
source |
str |
Citation |
notes |
str |
Observer notes |
src/geocode.py
Geocoding module. Converts city or region names to lat/lng coordinates using the Nominatim API (OpenStreetMap). Used during the data ingestion pipeline when records are provided with location names rather than explicit coordinates.
Caches results in data/raw/geocode_cache.json to avoid redundant API calls.
src/elevation.py
Queries the Open-Elevation API for records where elevation_m == 0.
Batches requests (max 100 per call). Writes results back to the record dict.
Data Flow in Detail
1. Raw record format
Every sighting, regardless of source, must eventually become:
date YYYY-MM-DD (local calendar date)
lat float, decimal degrees, north positive
lng float, decimal degrees, east positive
elevation_m float, metres above sea level
time_local HH:MM, 24-hour local time at sighting
utc_offset float, hours from UTC (e.g. 1.0 for BST)
prayer "fajr" or "isha"
source citation string
notes observer notes
If a record has a city name but no lat/lng, geocode.py fills it in.
If a record has elevation_m == 0, elevation.py fills it via the Open-Elevation API.
2. UTC conversion
utc_datetime = date + time_local - utc_offset (hours)
This is the single most error-prone step. Common failure modes:
- Using the wrong UTC offset (e.g. forgetting summer/winter DST)
- Using the standard timezone offset when the sighting date was in the alternate season
- Using the nominal timezone when the actual location's offset differs (e.g. parts of India)
All manually compiled records in verified_sightings.py include explicit utc_offset
values per-date, not per-timezone-name. This avoids DST ambiguity.
3. Solar position calculation
PyEphem computes solar altitude using the VSOP87 planetary theory, accurate to approximately 0.01°. Atmospheric refraction is the main source of uncertainty: the standard atmosphere model (1013.25 hPa, 15°C) is a good average but actual refraction varies with local conditions. For twilight observations near -12° altitude, refraction contributes negligibly.
Depression angle = -altitude. When the sun is below the horizon, ephem.Sun.alt
is negative. The depression angle is the absolute value.
4. Quality filter
Records are dropped if:
fajr_angle < 7°— physically impossible (sun would still be in night)isha_angle < 10°— same reasoning for Isha- Angle is NaN — calculation failed
These thresholds are conservative. Genuine sighting records produce 8°-21° for Fajr and 11°-22° for Isha. Values below 7° / 10° indicate a data entry error, most commonly a UTC offset mistake or a DST clock-change artifact.
Output Schema
Both output CSVs share this schema:
| Column | Type | Description |
|---|---|---|
date |
string | YYYY-MM-DD local date |
utc_dt |
string | ISO 8601 UTC datetime |
lat |
float | Decimal degrees |
lng |
float | Decimal degrees |
elevation_m |
float | Metres ASL |
day_of_year |
int | 1-366 |
fajr_angle or isha_angle |
float | Solar depression angle (°) |
source |
string | Citation |
notes |
string | Observer notes |
Source Hierarchy
Records are ranked by data quality:
| Tier | Source type | Example |
|---|---|---|
| 1 | Community astrophotography, panel-voted | OpenFajr Birmingham |
| 2 | DSLR + SQM instrumental observation | Kassim Bahali 2018 Malaysia |
| 3 | SQM photometry only | Saksono 2020 Indonesia |
| 4 | Multi-observer naked-eye, documented | Asim Yusuf UK, Hizbul Ulama UK |
| 5 | Single trained observer, per-date log | NRIAG Egypt individual nights |
| 6 | Published mean per season, time inferred | Hail Saudi Arabia (seasonal means) |
Tier 6 records (inferred times) are marked in notes. They contribute to geographic
diversity but carry more uncertainty than direct observations.
Known Limitations
-
Birmingham dominance. The OpenFajr dataset provides ~4,000 records but all from one location at 52.5°N. Any ML model trained on this data will extrapolate to all other latitudes. Geographic diversity is the primary gap.
-
Isha data scarcity. Only ~43 Isha records vs ~4,100 Fajr records. The Isha network depends on Shafaq al-Abyad observations, which are less systematically documented.
-
Atmospheric variability. The standard atmosphere model (1013.25 hPa, 15°C) does not capture day-to-day refraction variation. On cold clear nights, refraction is higher; on hot dry nights, lower. This introduces ~0.1°-0.3° uncertainty per record.
-
Observer skill variation. Naked-eye observations depend on the observer's dark adaptation, experience, and site conditions. The depression angle for a given "true dawn" varies across observers by up to 2°.