pray-calc-ml/.github/wiki/Data-Collection.md
Aric Camarata d8471f8ca5 chore: superclean compliance pass
- Migrate .wiki/ to .github/wiki/ (GCI standard for public repos)
- Add _Sidebar.md for GitHub Wiki navigation
- Update wiki-sync.yml to reference .github/wiki/ path
- Remove .markdownlintignore (covered by .vscode/settings.json)
- Migrate .allow-ai-terms to ALLOW_AI_TERMS_REPOS in pre-commit hook
- Expand .gitignore with full IDE and AI agent directory list
- Update README project structure reference
2026-02-28 11:55:08 -05:00

195 lines
6.5 KiB
Markdown

# Data Collection
This page explains how to collect sighting data, run the pipeline, and add new records.
---
## What data we collect
Each record in the dataset represents one confirmed human sighting with:
| Field | Description |
| --- | --- |
| Date | The calendar date of the sighting (local date) |
| Location | Latitude, longitude, and elevation in metres |
| Observed time | The local time at which the sighting occurred |
| UTC offset | The hours offset from UTC at that date and location |
The pipeline converts each record into a solar depression angle by back-calculating the sun's
position at the UTC moment of the sighting using PyEphem with atmospheric refraction.
**Not included:** calculated prayer times, angle guesses, or aggregate statistics. Only records
where an actual human reported "I saw true dawn at this time on this date at this location."
---
## Running the pipeline
### Prerequisites
```bash
# Python 3.10+
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install -r requirements.txt
```
### Full run (recommended)
```bash
python -m src.pipeline
```
This does three things in sequence:
1. **Fetches the OpenFajr iCal feed** from `calendar.google.com` — ~4,018 community-verified
Fajr records from Birmingham, UK, 2016-2026. Requires network access.
2. **Loads manually compiled records** from `src/collect/verified_sightings.py` — ~141 records
from peer-reviewed studies across 35 locations worldwide.
3. **Looks up missing elevations** via the [Open-Elevation API](https://open-elevation.com) for
any record where `elevation_m == 0`.
Output:
```
data/processed/fajr_angles.csv — ~4,105 Fajr records
data/processed/isha_angles.csv — ~43 Isha records
```
### Without elevation lookup
```bash
python -m src.pipeline --no-elevation-lookup
```
Skips the Open-Elevation API calls. Use this when:
- You're offline
- You want faster iteration while adding new records
- All records in `verified_sightings.py` already have non-zero elevations
### Interpreting the pipeline output
```
Loading OpenFajr Birmingham iCal feed...
4018 Fajr records from OpenFajr
Loading manually verified sightings...
141 manually compiled records
Computing solar depression angles...
Dropping 11 record(s) with implausible angles (< 7.0° Fajr / < 10.0° Isha):
FAJR 2021-03-27 ... angle=-18.71° — OpenFajr (openfajr.org)
...
Fajr dataset: 4105 records → data/processed/fajr_angles.csv
Isha dataset: 43 records → data/processed/isha_angles.csv
```
Records dropped with "implausible angles" are data entry or DST-transition artifacts. The
quality filter (7° for Fajr, 10° for Isha) removes physically impossible values. All dropped
records are logged so you can investigate them.
---
## Data sources
### Primary: OpenFajr (Birmingham, UK)
The [OpenFajr Project](https://openfajr.org) runs a continuous community astrophotography
program in Birmingham. A panel of scholars reviews daily sky photos and votes on the moment of
true dawn. The voted times are published as a public Google Calendar iCal feed.
- ~4,018 records, 2016-2026
- Location: 52.4862°N, 1.8904°W, 141m elevation
- All times are UTC (Z suffix in iCal)
- Fetched live by the pipeline — no local cache needed
This is the highest-quality source: actual community-reviewed per-date timestamps at a single
well-documented location. It provides 98% of the Fajr training data.
### Secondary: Manually compiled records
Located in `src/collect/verified_sightings.py`. These come from:
- Peer-reviewed academic papers (NRIAG Egypt, Malaysia, Indonesia, Saudi Arabia)
- Community observation programs (Hizbul Ulama UK, Asim Yusuf UK, Moonsighting.com)
- National religious body publications (AFIC Australia, Jordanian Awqaf, etc.)
See [Data Sources](Data-Sources) for the full citation table.
---
## Adding new sighting records
Open `src/collect/verified_sightings.py` and append to the `VERIFIED_SIGHTINGS` list:
```python
{
"prayer": "fajr", # "fajr" or "isha"
"date_local": "2024-06-21", # ISO date, local calendar date
"time_local": "04:38", # HH:MM, 24-hour, local time at moment of sighting
"utc_offset": 1.0, # hours from UTC (e.g. 1.0 for BST, -5.0 for EST, 5.5 for IST)
"lat": 51.150, # decimal degrees (south = negative)
"lng": -3.650, # decimal degrees (west = negative)
"elevation_m": 430.0, # metres above sea level (0 = will be looked up by API)
"source": "Your citation here",
"notes": "Any relevant notes about conditions, method, observer count, etc.",
}
```
### UTC offset tips
| Region | UTC offset |
| --- | --- |
| UK (BST, summer) | +1.0 |
| UK (GMT, winter) | 0.0 |
| Egypt / Eastern Europe (EET) | +2.0 |
| Egypt / EE (summer, EEST) | +3.0 |
| Saudi Arabia / Arabia Standard | +3.0 |
| Iran (IRST) | +3.5 |
| Iran (IRDT, summer) | +4.5 |
| UAE / Oman (GST) | +4.0 |
| Pakistan (PKT) | +5.0 |
| India / Sri Lanka (IST) | +5.5 |
| Bangladesh (BST) | +6.0 |
| Malaysia / Singapore (MYT) | +8.0 |
| Indonesia West (WIB) | +7.0 |
| Indonesia East (WIT) | +9.0 |
| Australia East (AEST, winter) | +10.0 |
| Australia East (AEDT, summer) | +11.0 |
| New Zealand (NZST) | +12.0 |
| New Zealand (NZDT) | +13.0 |
| US Eastern (EST) | -5.0 |
| US Eastern (EDT) | -4.0 |
| US Central (CST) | -6.0 |
| US Central (CDT) | -5.0 |
| West Africa (WAT) | +1.0 |
| East Africa (EAT) | +3.0 |
| South Africa (SAST) | +2.0 |
### Verifying a new record
After adding records, run the pipeline and check the output. A correctly entered record should
produce an angle between 8° and 21° for Fajr, or 11° and 22° for Isha. If the pipeline drops
your record (angle below the threshold), the time is too close to sunrise/sunset — recheck the
UTC offset and local time.
```bash
python -m src.pipeline --no-elevation-lookup 2>&1 | grep -A5 "Dropping"
```
---
## Priority gaps to fill
The Isha dataset is the most critical gap at ~43 records. Fajr has excellent Birmingham coverage
but needs more geographic diversity:
| Gap | What to look for |
| --- | --- |
| Isha (all regions) | Shafaq al-Abyad disappearance logs with explicit per-date timestamps |
| South America | Any Muslim community observation records with coordinates and times |
| Southeast Asia | Additional Indonesian/Malaysian per-night SQM data files |
| High latitudes (55°N+) | Scandinavian or northern Canadian observation logs |
| Sub-Saharan Africa | Observation records from West Africa, East Africa, Southern Africa |
---
*[← Home](Home) · [ML Crunching →](ML-Crunching)*