pray-calc-ml/.github/wiki/Data-Collection.md

208 lines
7.2 KiB
Markdown

# Data Collection
This page explains how to collect sighting data, run the pipeline, and add new records.
---
## What data we collect
Each record in the dataset represents one confirmed human sighting with:
| Field | Description |
| --- | --- |
| Date | The calendar date of the sighting (local date) |
| Location | Latitude, longitude, and elevation in metres |
| Observed time | The local time at which the sighting occurred |
| UTC offset | The hours offset from UTC at that date and location |
The pipeline converts each record into a solar depression angle by back-calculating the sun's
position at the UTC moment of the sighting using PyEphem with atmospheric refraction.
**Not included:** calculated prayer times, angle guesses, or aggregate statistics. Only records
where an actual human reported "I saw true dawn at this time on this date at this location."
---
## Running the pipeline
### Prerequisites
```bash
# Python 3.10+
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install -r requirements.txt
```
### Full run (recommended)
```bash
python -m src.pipeline
```
This does three things in sequence:
1. **Fetches the OpenFajr iCal feed** from `calendar.google.com` — ~4,018 community-verified
Fajr records from Birmingham, UK, 2016-2026. Requires network access.
2. **Loads manually compiled records** from `src/collect/verified_sightings.py` and per-source
CSVs in `data/raw/raw_sightings/`.
3. **Loads pre-computed SQM angles** from `src/collect/precomputed_angles.py` (1,621 Basthoni
2022 records where depression angles were measured directly by instrument).
4. **Looks up missing elevations** via the Open-Topo-Data API (with Open-Elevation fallback)
for any record where `elevation_m == 0`.
Output:
```
data/processed/fajr_angles.csv — 48,668 Fajr records
data/processed/isha_angles.csv — 34,529 Isha records
```
### Without elevation lookup
```bash
python -m src.pipeline --no-elevation-lookup
```
Skips the Open-Elevation API calls. Use this when:
- You're offline
- You want faster iteration while adding new records
- All records in `verified_sightings.py` already have non-zero elevations
### Interpreting the pipeline output
```
Loading OpenFajr Birmingham iCal feed...
4018 Fajr records from OpenFajr
Loading manually verified sightings...
... genuine manually compiled records (after quality filter)
Loading ingested raw CSV sightings...
... records from raw CSVs
Loading pre-computed angle records (SQM instrument data)...
1621 pre-computed angle records
Computing solar depression angles...
Dropping N record(s) with implausible angles (< 7.0° Fajr / < 10.0° Isha):
...
Fajr dataset: 48668 records → data/processed/fajr_angles.csv
Isha dataset: 34529 records → data/processed/isha_angles.csv
```
Records dropped with "implausible angles" are data entry or DST-transition artifacts. The
quality filter (7° for Fajr, 10° for Isha) removes physically impossible values. All dropped
records are logged so you can investigate them.
---
## Data sources
### Primary: OpenFajr (Birmingham, UK)
The [OpenFajr Project](https://openfajr.org) runs a continuous community astrophotography
program in Birmingham. A panel of scholars reviews daily sky photos and votes on the moment of
true dawn. The voted times are published as a public Google Calendar iCal feed.
- ~4,018 records, 2016-2026
- Location: 52.4862°N, 1.8904°W, 141m elevation
- All times are UTC (Z suffix in iCal)
- Fetched live by the pipeline — no local cache needed
This is the highest-quality source: actual community-reviewed per-date timestamps at a single
well-documented location. It provides ~68% of the Fajr training data.
### Secondary: Basthoni 2022 SQM network (Indonesia)
1,621 per-night SQM records across 46 Indonesian sites, extracted from Basthoni's 2022 PhD
dissertation at UIN Walisongo. Each record is a direct instrument measurement where the Fajr
depression angle was determined by linear fitting of SQM time-series data. Loaded by
`src/collect/precomputed_angles.py`.
### Tertiary: Manually compiled records
Located in `src/collect/verified_sightings.py` and per-source CSVs in `data/raw/raw_sightings/`.
These come from:
- Peer-reviewed academic papers (NRIAG Egypt, Malaysia, Indonesia, Saudi Arabia, Mauritania)
- Community observation programs (Miftahi/Shaukat UK, Asim Yusuf UK, Moonsighting.com)
- Institutional SQM data (BRIN Mount Timau, BRIN multistation network)
See [Data Sources](Data-Sources) for the full citation table.
---
## Adding new sighting records
Open `src/collect/verified_sightings.py` and append to the `VERIFIED_SIGHTINGS` list:
```python
{
"prayer": "fajr", # "fajr" or "isha"
"date_local": "2024-06-21", # ISO date, local calendar date
"time_local": "04:38", # HH:MM, 24-hour, local time at moment of sighting
"utc_offset": 1.0, # hours from UTC (e.g. 1.0 for BST, -5.0 for EST, 5.5 for IST)
"lat": 51.150, # decimal degrees (south = negative)
"lng": -3.650, # decimal degrees (west = negative)
"elevation_m": 430.0, # metres above sea level (0 = will be looked up by API)
"source": "Your citation here",
"notes": "Any relevant notes about conditions, method, observer count, etc.",
}
```
### UTC offset tips
| Region | UTC offset |
| --- | --- |
| UK (BST, summer) | +1.0 |
| UK (GMT, winter) | 0.0 |
| Egypt / Eastern Europe (EET) | +2.0 |
| Egypt / EE (summer, EEST) | +3.0 |
| Saudi Arabia / Arabia Standard | +3.0 |
| Iran (IRST) | +3.5 |
| Iran (IRDT, summer) | +4.5 |
| UAE / Oman (GST) | +4.0 |
| Pakistan (PKT) | +5.0 |
| India / Sri Lanka (IST) | +5.5 |
| Bangladesh (BST) | +6.0 |
| Malaysia / Singapore (MYT) | +8.0 |
| Indonesia West (WIB) | +7.0 |
| Indonesia East (WIT) | +9.0 |
| Australia East (AEST, winter) | +10.0 |
| Australia East (AEDT, summer) | +11.0 |
| New Zealand (NZST) | +12.0 |
| New Zealand (NZDT) | +13.0 |
| US Eastern (EST) | -5.0 |
| US Eastern (EDT) | -4.0 |
| US Central (CST) | -6.0 |
| US Central (CDT) | -5.0 |
| West Africa (WAT) | +1.0 |
| East Africa (EAT) | +3.0 |
| South Africa (SAST) | +2.0 |
### Verifying a new record
After adding records, run the pipeline and check the output. A correctly entered record should
produce an angle between 8° and 21° for Fajr, or 11° and 22° for Isha. If the pipeline drops
your record (angle below the threshold), the time is too close to sunrise/sunset — recheck the
UTC offset and local time.
```bash
python -m src.pipeline --no-elevation-lookup 2>&1 | grep -A5 "Dropping"
```
---
## Priority gaps to fill
The Isha dataset is the most critical gap at 46 records. Fajr has excellent Birmingham coverage
but needs more geographic diversity:
| Gap | What to look for |
| --- | --- |
| Isha (all regions) | Shafaq al-Abyad disappearance logs with explicit per-date timestamps |
| South America | Any Muslim community observation records with coordinates and times |
| Southeast Asia | Additional Indonesian/Malaysian per-night SQM data files |
| High latitudes (55°N+) | Scandinavian or northern Canadian observation logs |
| Sub-Saharan Africa | Observation records from West Africa, East Africa, Southern Africa |
---
*[← Home](Home) · [ML Crunching →](ML-Crunching)*