# Data Collection This page explains how to collect sighting data, run the pipeline, and add new records. --- ## What data we collect Each record in the dataset represents one confirmed human sighting with: | Field | Description | | --- | --- | | Date | The calendar date of the sighting (local date) | | Location | Latitude, longitude, and elevation in metres | | Observed time | The local time at which the sighting occurred | | UTC offset | The hours offset from UTC at that date and location | The pipeline converts each record into a solar depression angle by back-calculating the sun's position at the UTC moment of the sighting using PyEphem with atmospheric refraction. **Not included:** calculated prayer times, angle guesses, or aggregate statistics. Only records where an actual human reported "I saw true dawn at this time on this date at this location." --- ## Running the pipeline ### Prerequisites ```bash # Python 3.10+ python -m venv .venv source .venv/bin/activate # on Windows: .venv\Scripts\activate pip install -r requirements.txt ``` ### Full run (recommended) ```bash python -m src.pipeline ``` This does three things in sequence: 1. **Fetches the OpenFajr iCal feed** from `calendar.google.com` — ~4,018 community-verified Fajr records from Birmingham, UK, 2016-2026. Requires network access. 2. **Loads manually compiled records** from `src/collect/verified_sightings.py` and per-source CSVs in `data/raw/raw_sightings/`. 3. **Loads pre-computed SQM angles** from `src/collect/precomputed_angles.py` (1,621 Basthoni 2022 records where depression angles were measured directly by instrument). 4. **Looks up missing elevations** via the Open-Topo-Data API (with Open-Elevation fallback) for any record where `elevation_m == 0`. Output: ``` data/processed/fajr_angles.csv — 48,668 Fajr records data/processed/isha_angles.csv — 34,529 Isha records ``` ### Without elevation lookup ```bash python -m src.pipeline --no-elevation-lookup ``` Skips the Open-Elevation API calls. Use this when: - You're offline - You want faster iteration while adding new records - All records in `verified_sightings.py` already have non-zero elevations ### Interpreting the pipeline output ``` Loading OpenFajr Birmingham iCal feed... 4018 Fajr records from OpenFajr Loading manually verified sightings... ... genuine manually compiled records (after quality filter) Loading ingested raw CSV sightings... ... records from raw CSVs Loading pre-computed angle records (SQM instrument data)... 1621 pre-computed angle records Computing solar depression angles... Dropping N record(s) with implausible angles (< 7.0° Fajr / < 10.0° Isha): ... Fajr dataset: 48668 records → data/processed/fajr_angles.csv Isha dataset: 34529 records → data/processed/isha_angles.csv ``` Records dropped with "implausible angles" are data entry or DST-transition artifacts. The quality filter (7° for Fajr, 10° for Isha) removes physically impossible values. All dropped records are logged so you can investigate them. --- ## Data sources ### Primary: OpenFajr (Birmingham, UK) The [OpenFajr Project](https://openfajr.org) runs a continuous community astrophotography program in Birmingham. A panel of scholars reviews daily sky photos and votes on the moment of true dawn. The voted times are published as a public Google Calendar iCal feed. - ~4,018 records, 2016-2026 - Location: 52.4862°N, 1.8904°W, 141m elevation - All times are UTC (Z suffix in iCal) - Fetched live by the pipeline — no local cache needed This is the highest-quality source: actual community-reviewed per-date timestamps at a single well-documented location. It provides ~68% of the Fajr training data. ### Secondary: Basthoni 2022 SQM network (Indonesia) 1,621 per-night SQM records across 46 Indonesian sites, extracted from Basthoni's 2022 PhD dissertation at UIN Walisongo. Each record is a direct instrument measurement where the Fajr depression angle was determined by linear fitting of SQM time-series data. Loaded by `src/collect/precomputed_angles.py`. ### Tertiary: Manually compiled records Located in `src/collect/verified_sightings.py` and per-source CSVs in `data/raw/raw_sightings/`. These come from: - Peer-reviewed academic papers (NRIAG Egypt, Malaysia, Indonesia, Saudi Arabia, Mauritania) - Community observation programs (Miftahi/Shaukat UK, Asim Yusuf UK, Moonsighting.com) - Institutional SQM data (BRIN Mount Timau, BRIN multistation network) See [Data Sources](Data-Sources) for the full citation table. --- ## Adding new sighting records Open `src/collect/verified_sightings.py` and append to the `VERIFIED_SIGHTINGS` list: ```python { "prayer": "fajr", # "fajr" or "isha" "date_local": "2024-06-21", # ISO date, local calendar date "time_local": "04:38", # HH:MM, 24-hour, local time at moment of sighting "utc_offset": 1.0, # hours from UTC (e.g. 1.0 for BST, -5.0 for EST, 5.5 for IST) "lat": 51.150, # decimal degrees (south = negative) "lng": -3.650, # decimal degrees (west = negative) "elevation_m": 430.0, # metres above sea level (0 = will be looked up by API) "source": "Your citation here", "notes": "Any relevant notes about conditions, method, observer count, etc.", } ``` ### UTC offset tips | Region | UTC offset | | --- | --- | | UK (BST, summer) | +1.0 | | UK (GMT, winter) | 0.0 | | Egypt / Eastern Europe (EET) | +2.0 | | Egypt / EE (summer, EEST) | +3.0 | | Saudi Arabia / Arabia Standard | +3.0 | | Iran (IRST) | +3.5 | | Iran (IRDT, summer) | +4.5 | | UAE / Oman (GST) | +4.0 | | Pakistan (PKT) | +5.0 | | India / Sri Lanka (IST) | +5.5 | | Bangladesh (BST) | +6.0 | | Malaysia / Singapore (MYT) | +8.0 | | Indonesia West (WIB) | +7.0 | | Indonesia East (WIT) | +9.0 | | Australia East (AEST, winter) | +10.0 | | Australia East (AEDT, summer) | +11.0 | | New Zealand (NZST) | +12.0 | | New Zealand (NZDT) | +13.0 | | US Eastern (EST) | -5.0 | | US Eastern (EDT) | -4.0 | | US Central (CST) | -6.0 | | US Central (CDT) | -5.0 | | West Africa (WAT) | +1.0 | | East Africa (EAT) | +3.0 | | South Africa (SAST) | +2.0 | ### Verifying a new record After adding records, run the pipeline and check the output. A correctly entered record should produce an angle between 8° and 21° for Fajr, or 11° and 22° for Isha. If the pipeline drops your record (angle below the threshold), the time is too close to sunrise/sunset — recheck the UTC offset and local time. ```bash python -m src.pipeline --no-elevation-lookup 2>&1 | grep -A5 "Dropping" ``` --- ## Priority gaps to fill The Isha dataset is the most critical gap at 46 records. Fajr has excellent Birmingham coverage but needs more geographic diversity: | Gap | What to look for | | --- | --- | | Isha (all regions) | Shafaq al-Abyad disappearance logs with explicit per-date timestamps | | South America | Any Muslim community observation records with coordinates and times | | Southeast Asia | Additional Indonesian/Malaysian per-night SQM data files | | High latitudes (55°N+) | Scandinavian or northern Canadian observation logs | | Sub-Saharan Africa | Observation records from West Africa, East Africa, Southern Africa | --- *[← Home](Home) · [ML Crunching →](ML-Crunching)*