Tracked: BSRN/SURFRAD processors (reference, excluded from pipeline),
GaN-MN downloader, academic paper fetcher, Madrid SQM processor,
ML analysis scripts (src/analyze/), umsu_medan_2024 raw sightings.
Gitignored: global_extrapolator, instant_1m_injector/vectorized,
massive_harvest_engine, massive_sqm_downloader, global_sqm_harvester,
run_infinite_pipeline.sh, run_massive_collection.sh, search_papers.py
(agent-generated experimental scripts, not part of core pipeline).
Add 6 new data collection pipelines and their processed outputs:
Sources added:
- TESS/Stars4All photometer network: 37 months (Jun 2017-Aug 2020),
~40k raw events from 100+ European stations via Zenodo archives
- Globe at Night citizen science: 26k twilight observations (2006-2024),
filtered from 308k total observations for solar depression 6-22 deg
- GaN-MN continuous monitoring: 45 months (Jan 2022-Sep 2025),
~12.5k twilight events from 88 stations across 20+ countries
- Galicia SQM network: 14 stations, 1-min resolution, 7.5k events
- Madrid/Majadahonda SQM: multi-year continuous monitoring, 3.1k events
- washetdonker.nl Netherlands: 7 stations, 3.3k morning events
- Academic papers: Jordan (Abed 2015), Fayum Egypt, India photometer
Pipeline changes:
- ingest.py: add all new files to APPROVED_RAW_CSVS allowlist,
fix filter to use allowlist instead of hardcoded exclusions
- .gitignore: exclude bulk raw data directories (BSRN, TESS, GaN-MN,
washetdonker, Globe at Night downloads)
Final dataset: 56,668 Fajr + 34,763 Isha = 91,431 total records
Previous: 5,871 Fajr + 46 Isha = 5,917 total records
Identified three sources of cross-source duplication and fixed each:
1. Kassim Bahali 2018 Pekan Pahang (9 records)
Same 9 June-July 2017 DSLR observations existed in both
verified_sightings.py (Table 2 entries) and the raw CSV
kassim_bahali_2017_malaysia.csv. Removed from verified_sightings;
raw CSV is the canonical source with richer cloud/conditions notes.
2. BRIN Mount Timau SQM dataset (22 records)
timau_sqm_fajr.csv contained two SQM threshold readings per night:
target=18.0° (75 records, primary) and target=16.51° (22 records,
derived from the 75-night mean). Removed target=16.51 rows.
Each night now has exactly one Fajr time.
3. Khalifa 2018 Hail Fajr (4 records)
Original batch had times producing implausible angles: 2015-01-15
gave 12.6° and 2015-06-21 gave 19.3° (paper reports 14.014°±0.317°).
Removed the four bad-time records. Batch 16a replacements (computed
from the paper mean D0) remain and give consistent 13.9-14.1° angles.
Pipeline: add automatic deduplication guard. After combining all sources,
any (prayer, date, lat rounded to 3dp, lng rounded to 3dp) duplicate is
logged and dropped (keep first). This prevents future cross-source overlaps
from silently inflating the dataset or training on the same observation twice.
Dataset: fajr_angles.csv 4535 records, isha_angles.csv 120 records
Zero duplicates confirmed.