Commit graph

5 commits

Author SHA1 Message Date
Aric Camarata
ada08e7ec4 data: expand dataset from 5.9k to 91k records via 6 new SQM sources
Add 6 new data collection pipelines and their processed outputs:

Sources added:
- TESS/Stars4All photometer network: 37 months (Jun 2017-Aug 2020),
  ~40k raw events from 100+ European stations via Zenodo archives
- Globe at Night citizen science: 26k twilight observations (2006-2024),
  filtered from 308k total observations for solar depression 6-22 deg
- GaN-MN continuous monitoring: 45 months (Jan 2022-Sep 2025),
  ~12.5k twilight events from 88 stations across 20+ countries
- Galicia SQM network: 14 stations, 1-min resolution, 7.5k events
- Madrid/Majadahonda SQM: multi-year continuous monitoring, 3.1k events
- washetdonker.nl Netherlands: 7 stations, 3.3k morning events
- Academic papers: Jordan (Abed 2015), Fayum Egypt, India photometer

Pipeline changes:
- ingest.py: add all new files to APPROVED_RAW_CSVS allowlist,
  fix filter to use allowlist instead of hardcoded exclusions
- .gitignore: exclude bulk raw data directories (BSRN, TESS, GaN-MN,
  washetdonker, Globe at Night downloads)

Final dataset: 56,668 Fajr + 34,763 Isha = 91,431 total records
Previous: 5,871 Fajr + 46 Isha = 5,917 total records
2026-03-22 16:39:29 -04:00
Aric Camarata
c1eeef53c4 Expand dataset to 5,871 Fajr / 46 Isha across 114 locations
Major additions:
- Extract all 1,621 Basthoni 2022 SQM records (46 Indonesian sites,
  Lampiran 2-5) via precomputed_angles.py
- Add 9 new raw sighting CSVs: Abdel-Hadi Malaysia, BRIN multistation,
  Kassim Bahali (2017+2019), Khalifa Saudi, Moonsighting.com,
  Shaukat 2015 Blackburn UK, Walisongo Sulawesi
- Curate aggregate D0 database (115 entries) in research/

Pipeline improvements:
- Open-Topo-Data SRTM30m primary elevation API with fallback
- APPROVED_RAW_CSVS allowlist prevents circular data ingestion
- Pre-computed angle merge path (bypasses back-calculation for SQM data)
- BAD_NOTE_MARKERS quality filter for excluded sources

Collection tools:
- BRIN multistation SQM processors
- PDF/HTML table extractor for academic papers
- Source tracking database (collection_manifest.json)

Documentation:
- Rewrite .wiki/Data.md and .wiki/Research.md from scratch
- Expand Data-Sources.md with full Basthoni Lampiran breakdown
- Add 14 researcher outreach drafts
- Update .gitignore to exclude bulk/experimental files
2026-02-28 10:51:01 -05:00
Aric Camarata
1c8187cfc4 data: deduplicate dataset — 35 Fajr + 1 Isha duplicates removed
Identified three sources of cross-source duplication and fixed each:

1. Kassim Bahali 2018 Pekan Pahang (9 records)
   Same 9 June-July 2017 DSLR observations existed in both
   verified_sightings.py (Table 2 entries) and the raw CSV
   kassim_bahali_2017_malaysia.csv. Removed from verified_sightings;
   raw CSV is the canonical source with richer cloud/conditions notes.

2. BRIN Mount Timau SQM dataset (22 records)
   timau_sqm_fajr.csv contained two SQM threshold readings per night:
   target=18.0° (75 records, primary) and target=16.51° (22 records,
   derived from the 75-night mean). Removed target=16.51 rows.
   Each night now has exactly one Fajr time.

3. Khalifa 2018 Hail Fajr (4 records)
   Original batch had times producing implausible angles: 2015-01-15
   gave 12.6° and 2015-06-21 gave 19.3° (paper reports 14.014°±0.317°).
   Removed the four bad-time records. Batch 16a replacements (computed
   from the paper mean D0) remain and give consistent 13.9-14.1° angles.

Pipeline: add automatic deduplication guard. After combining all sources,
any (prayer, date, lat rounded to 3dp, lng rounded to 3dp) duplicate is
logged and dropped (keep first). This prevents future cross-source overlaps
from silently inflating the dataset or training on the same observation twice.

Dataset: fajr_angles.csv 4535 records, isha_angles.csv 120 records
Zero duplicates confirmed.
2026-02-26 05:13:28 -05:00
Aric Camarata
cc8d3c33d1 Expand dataset to 4,396 Fajr / 70 Isha records across 80 locations
Added sources and sites:
- Mount Timau NTT (CC0 BRIN SQM dataset): 97 individual Fajr nights
  at two target angles (16.51° and 18.0°); pristine 21.86 mpsas site,
  1,600m; data.brin.go.id hdl:20.500.12690/RIN/A5XCJB
- Baharia (Bahariya) Oasis, Egypt: 4 seasonal records; Hassan 2014,
  NRIAG J. 3:23-26; naked-eye multi-site 1984-1987, mean 14.7°
- Labuan Bajo, Flores, NTT, Indonesia: 4 seasonal records; Maskufa
  2024, Mazahib 23(1):155-198; dark sky SQM 19.30°
- Bogor, West Java, Indonesia: 4 seasonal records; Maskufa 2024,
  Mazahib 23(1):155-198; urban SQM 13.58°
- Pekan, Pahang, Malaysia: 9 individual DSLR observations Jun-Jul 2017;
  Kassim Bahali 2018, Sains Malaysiana 47(11):2877-2885; Do range
  -15.45° to -18.06°
- Kuala Terengganu, Malaysia: 1 record; Kassim Bahali 2018 Fig 4,
  Do=-16°, time inferred via PyEphem
- Additional batch 3 aggregate sites: Tubruq Libya (3 subsets),
  Fayum Egypt, Biak Papua, Manado North Sulawesi, Lombok NTB,
  Makkah, Madinah, Karachi, Ankara, Marrakech, Kano, Johannesburg,
  Dhaka, Alexandria

Source correction: removed incorrect Setyanto 2021 Al-Hilal
attribution from Labuan Bajo and Bogor (that paper covers zodiac
light, not Fajr, at different Indonesian sites)
2026-02-25 20:44:37 -05:00
Aric Camarata
6e0f4a679c Rebuild as Python data science project
Replaces the original JS calibration library with a pure Python pipeline
for collecting and back-calculating solar depression angles from human-verified
Fajr and Isha prayer sightings.

What this does:
- src/pipeline.py: master pipeline; fetches iCal + manual records, back-calculates
  angles via PyEphem, applies quality filters, exports two clean CSVs
- src/collect/openfajr.py: parses the OpenFajr Birmingham iCal feed (~4,018 records)
- src/collect/verified_sightings.py: manually compiled records from peer-reviewed
  studies (Egypt, Saudi Arabia, Malaysia, Indonesia, UK, USA, Canada, and more)
- src/angle_calc.py: PyEphem back-calculation with atmospheric refraction
- src/elevation.py: Open-Elevation API batch lookup

Datasets generated:
- data/processed/fajr_angles.csv: 4,105 confirmed Fajr records, 35 locations,
  latitude range -37.8 to 53.7 degrees, date range 1985-2026
- data/processed/isha_angles.csv: 43 confirmed Isha records, 20+ locations

Also includes:
- notebooks/01_exploratory_analysis.ipynb: latitude, TOY, elevation pattern analysis
- research/: academic paper summaries (not training data)
- data/raw/sources.md: full citation table for all data sources
2026-02-25 19:32:47 -05:00