Loading project...
# relationship-archaeology
**Status:** ✅ Extraction Complete
**Phase:** Phase 1 — Full Extraction (Complete: 743,395 of 743,425 messages - 99.996%)
**Last Activity:** 2026-02-09 (Merged all batches - extraction finished)
---
## Linear Metadata
**Project ID:** `a089175b-c479-4c07-996e-e2e70ea0238a`
**Team ID:** `96b685fe-2252-47c5-97ee-273d8c484942`
**Last Synced:** `2026-02-06T13:22:36.565Z`
## Overview
Analyze 750K+ Gmail messages to surface long-term relationships, communication patterns, and reconnection opportunities.
---
## 🚨 275K "MISSING" EMAILS INVESTIGATION (2026-02-08)
### Root Cause: FOUND ✅
**The 275K emails are NOT deleted/archived — they ARE accessible via Gmail API.**
Investigation confirmed Gmail API can access ALL 760,372 messages. The gap is due to **extraction query gaps**, NOT Gmail limitations.
### Year-by-Year Gap Analysis
| Year | Gmail | Extracted | Gap | % Captured |
|------|-------|-----------|-----|------------|
| 2004 | 16,059 | 9,785 | 6,274 | 60.9% ⚠️ |
| 2005 | 23,840 | 7,923 | 15,917 | 33.2% ⚠️ |
| 2006 | 15,679 | 6,310 | 9,369 | 40.2% ⚠️ |
| 2007 | 19,329 | 18,003 | 1,326 | 93.1% |
| **2008** | **7,560** | **1,601** | **5,959** | **21.2%** ⚠️ |
| **2009** | **30,498** | **1,690** | **28,808** | **5.5%** ⚠️ |
| **2010** | **35,270** | **1** | **35,269** | **0.0%** ⚠️ |
| 2011 | 77,491 | 64,636 | 12,855 | 83.4% ⚠️ |
| 2012 | 35,964 | 35,894 | 70 | 99.8% ✅ |
| 2013 | 43,909 | 41,793 | 2,116 | 95.2% ✅ |
| 2014 | 31,287 | 30,814 | 473 | 98.5% ✅ |
| 2015 | 29,026 | 27,830 | 1,196 | 95.9% ✅ |
| 2016 | 35,755 | 34,075 | 1,680 | 95.3% ✅ |
| 2017 | 46,988 | 43,693 | 3,295 | 93.0% ✅ |
| 2018 | 50,724 | 42,361 | 8,363 | 83.5% ⚠️ |
| 2019 | 34,517 | 31,486 | 3,031 | 91.2% ✅ |
| 2020 | 36,724 | 35,630 | 1,094 | 97.0% ✅ |
| 2021 | 36,457 | 26,296 | 10,161 | 72.1% ⚠️ |
| 2022 | 35,703 | 23,420 | 12,283 | 65.6% ⚠️ |
| 2023 | 33,816 | 18,446 | 15,370 | 54.5% ⚠️ |
| 2024 | 31,639 | 17,454 | 14,185 | 55.2% ⚠️ |
| 2025 | 32,030 | 30,514 | 1,516 | 95.3% ✅ |
| 2026 | 3,087 | 1,914 | 1,173 | 62.0% |
| **TOTAL** | **743,352** | **551,569** | **191,783** | **74.2%** |
### Worst Gaps (Priority Order)
1. **2010: 35,269 missing** (0% captured!) - Extraction jumped from 2011/04/02 to `before:2010`
2. **2009: 28,808 missing** (5.5% captured)
3. **2005: 15,917 missing** (33.2% captured)
4. **2023: 15,370 missing** (54.5% captured)
5. **2024: 14,185 missing** (55.2% captured)
6. **2011: 12,855 missing** (83.4% captured)
7. **2022: 12,283 missing** (65.6% captured)
8. **2021: 10,161 missing** (72.1% captured)
---
## ✅ EXTRACTION COMPLETE (2026-02-09)
### Final Results
**Master File:** `raw/all-extracted-deduplicated.jsonl`
**Total Messages:** 743,395 (99.996% of 743,425 found)
**Gap:** Only 30 messages missing (0.004%)
**File Size:** ~442MB
**Format:** JSONL (one JSON object per line)
### Enhanced Message Schema
Each line contains a JSON object with:
```json
{
// Core identifiers
"id": "f9ceb28f6fc5476",
"threadId": "f9ceb28f6fc5476",
// Headers
"from": "Williams-Sonoma <customer@service.williams-sonoma.com>",
"to": "jgirard01@sprynet.com",
"cc": "",
"bcc": "",
"subject": "Order Confirmation",
"date": "31 Dec 2003 23:51:04 +0000",
// Reply chain metadata
"inReplyTo": "",
"references": "",
"messageId": "<...>",
// Gmail metadata
"snippet": "Order# 033657322672 Dear John...",
"labelIds": ["Label_3541103461816185034"],
// Derived fields
"direction": "received", // or "sent" or "unknown"
"fromEmail": "customer@service.williams-sonoma.com",
"domain": "service.williams-sonoma.com",
// Filtering
"blacklisted": false,
"blacklistReason": null
}
```
### Ready for Phase 2
With extraction complete, the project is ready for:
1. **Contact Aggregation** - Build unified contact database with relationship metrics
2. **Monthly Heat Maps** - Sum conversation volume by contact per month (2003-2026)
3. **Relationship Categorization** - Classify contacts (work, friend, family, romantic, etc.)
4. **Relationship Scoring** - Multi-dimensional analysis for reconnection discovery
5. **Visualization & Export** - Timeline analysis, heat maps, interactive dashboard
**Linear Epic:** [MIN-1097](https://linear.app/hippochat/issue/MIN-1097/phase-2-contact-aggregation-and-relationship-analysis)
**Design Doc:** `docs/phase2-features.md`
---
## 🔄 DAY-BY-DAY EXTRACTION RUN (2026-02-09) [COMPLETED]
### Status: Resumed & Running ✅
**Original Start:** 2026-02-08 19:07 PST
**Crashed:** ~2026-02-09 03:50 PST (at 700,184/743,425 - 94.2%)
**Resumed:** 2026-02-09 04:45 PST
**Current Progress:** 575,708 already extracted + 501/167,701 gaps being filled (77.5% total)
**Last Updated:** 2026-02-09 04:56 PST
**Process ID:** 43101
**Failed Messages:** 0 (excellent API stability)
### Scan Phase Complete ✅
- **Total Messages Found:** 743,425 across 8,076 days
- **Date Range:** 2003-12-31 to 2026-02-07
- **Days with Messages:** 8,071
### Extraction Phase: Gap-Fill in Progress
- **Already Extracted:** 575,708 messages (found in partial files)
- **Missing (gaps):** 167,701 messages across 7,246 days
- **Currently Fetching:** 501/167,701 (0.3% of gaps, 77.5% overall)
- **Process Status:** Running (PID 43101) with adaptive concurrency
- **Checkpoint System:** ✅ Working (saves to `state/extract-state.jsonl`)
- **Estimated Completion:** ~8-10 hours at current rate
### Technical Fixes (2026-02-09)
- **Problem:** The extraction process was crashing due to excessive memory usage. The checkpoint file (`state/extract-state.json`) was being fully read into memory and rewritten on every checkpoint, causing a failure when the file grew too large.
- **Solution:** Re-engineered the checkpoint mechanism in `scripts/day-by-day-extract.mjs` to use a memory-efficient streaming format.
- The checkpoint file is now treated as a JSONL stream. A `.json` extension is kept for simplicity, but the internal format is line-delimited.
- Instead of reading the entire file, the script now reads it line by line.
- Instead of rewriting the entire file, the script now appends only new records.
- **Impact:** This resolves the memory crash and allows the extraction to run to completion reliably.
### What Happened
1. **Original extraction** (Feb 8 19:07 - Feb 9 03:50): Extracted 700,184 messages before crashing/stopping
2. **Gap analysis** (Feb 9 04:45): Script identified 575,708 unique messages already extracted
3. **Resume started** (Feb 9 04:45): Now extracting 167,701 missing messages to complete dataset
### Key Findings
- Gmail API can access ALL 743,425 messages (confirmed via day-by-day scan)
- Extraction successfully wrote 700K+ messages to partial file before stopping
- Script's gap detection working perfectly - identified 167K missing messages
- Adaptive rate limiting preventing API issues (0 failures so far)
### Current Status (2026-02-09 04:56 PST)
1. ✅ **Extraction resumed successfully** - Process PID 43101, 0 failures
2. ⏳ **In progress** - 501/167,701 gaps (0.3%), 575,708 already done (77.5% overall)
3. 📊 **Rate stable** - Adaptive concurrency (20-50) managing API limits well
4. 💾 **Checkpoint active** - Auto-saves progress, safe to resume if interrupted again
### Next Steps After Completion
1. **Verify extraction** - Run `--verify` to confirm all 743,425 messages captured
2. **Aggregate contacts** - Run `--aggregate` phase for relationship analysis
3. **Run analysis** - Execute MIN-518 (Relationship Analysis Engine)
### Technical Notes
- **Script:** `scripts/day-by-day-extract.mjs --extract` (resumed from checkpoint)
- **Checkpoint:** `state/extract-state.jsonl` (re-engineered to fix memory crash)
- **Output:** `raw/enhanced-batch-[timestamp].jsonl` (written at completion)
- **Scan Results:** `state/day-scan-results.json` (19MB, 8,071 days scanned)
- **Process PID:** 98246 (stable, 0 failures so far)
- **Monitor:** Run `scripts/monitor-extraction.sh` to check progress
---
## 🚀 ENHANCED EXTRACTION SCHEMA (2026-02-08)
### Implementation Complete ✅
| Component | Status | Details |
|-----------|--------|---------|
| **Blacklist Fix** | ✅ | gmail.com removed (was incorrectly filtered) |
| **Direction Detection** | ✅ | Sent/received based on John's addresses |
| **CC/BCC Headers** | ✅ | Full cc/bcc lists captured |
| **Gmail Labels** | ✅ | labelIds array captured |
| **Thread References** | ✅ | In-Reply-To, References, Message-ID |
| **Contact Aggregation** | ✅ | Reciprocity, thread depth, CSV export |
### John's Email Addresses (Direction Detection)
```javascript
const JOHN_EMAILS = [
'jgirard@gmail.com',
'john@clickability.com',
'jgirard01@sprynet.com',
'jgirard@socialrep.net',
'jgirard@stanfordalumni.org',
'john.girard@embarke.com',
'jgirard@mtell.com',
'jgirard@manumatix.com',
'john@pacificcoasttriage.com',
'johng@boxxcorp.com',
'john@cience.com',
'meeting@tungle.com',
'jg@vitaledge.io',
'jgirard@zingle.me',
'jgirard01@icloud.com',
'jgirard@zingleme.com',
'john@echopath.co',
'jg@tastradesoft.com',
'jgirard@alumni.stanford.edu',
'jgirard@alumni.stanford.org',
'john@fluxbanking.com',
'john@writeforme.io',
'jgirard@connectifier.com',
'john@blocktalk.com',
'johng@nvrmnd.life',
'john@cience.net',
'john.girard@dbaplatform.com',
'jgirard@brightloop.com',
'john@leasepace.com'
];
```
### Enhanced Message Schema
```javascript
{
// Core identifiers
id, threadId,
// Headers (enhanced)
from, to, cc, bcc, subject, date,
// Reply chain (NEW)
inReplyTo, references, messageId,
// Gmail metadata
snippet, labelIds, // labelIds is NEW
// Derived fields (NEW)
direction, // 'sent' | 'received'
fromEmail, domain,
// Filtering
blacklisted, blacklistReason
}
```
### Enhanced Contact Schema
```javascript
{
email, displayName, domain,
firstSeen, lastSeen,
messageCount, sentCount, receivedCount, // NEW: direction counts
reciprocityRatio, // NEW: sent/received ratio
threadCount, avgThreadDepth, // NEW: thread metrics
hasReplied, hasReceivedReply // NEW: engagement flags
}
```
### Next Steps
1. **Run enhanced extraction:**
```bash
node scripts/day-by-day-extract.mjs --full
```
2. **Analysis phase:** See MIN-518 (Relationship Analysis & Characterization Engine)
---
### Fix Plan: Day-by-Day Extraction (Designed 2026-02-08)
**Linear Issue:** [MIN-514](https://linear.app/hippochat/issue/MIN-514/implement-day-by-day-extraction-for-zero-gap-email-capture)
**Design Doc:** `DAY-BY-DAY-EXTRACTION-DESIGN.md`
**Strategy:** Query each day individually to guarantee zero gaps:
```
after:YYYY/MM/DD before:YYYY/MM/DD+1
```
**Why this works:**
- Each day is atomic (no boundary ambiguity)
- Every email has exactly one date
- Maximum ~500 emails/day (well within API limits)
**Time Estimate:** ~2.5 hours for 192K gap fill
**Gmail API Limits (via Context7):**
- Per-user: 15,000 quota units/minute
- messages.list/get: 5 units each
- Max sustainable rate: 30-40 messages/sec
**Implementation:** `scripts/day-by-day-extract.mjs` (to be built)
---
### Legacy Fix Plan (Superseded)
Run `fast-extract.mjs` with targeted year queries:
```bash
# Priority 1: 2008-2010 gap (70K emails)
node scripts/fast-extract.mjs -q "after:2010/01/01 before:2011/01/01" -m 40000
node scripts/fast-extract.mjs -q "after:2009/01/01 before:2010/01/01" -m 35000
node scripts/fast-extract.mjs -q "after:2008/01/01 before:2009/01/01" -m 10000
# Priority 2: Recent years (40K+ emails)
node scripts/fast-extract.mjs -q "after:2023/01/01 before:2024/01/01" -m 20000
node scripts/fast-extract.mjs -q "after:2024/01/01 before:2025/01/01" -m 20000
# Priority 3: Other gaps
node scripts/fast-extract.mjs -q "after:2021/01/01 before:2022/01/01" -m 15000
node scripts/fast-extract.mjs -q "after:2022/01/01 before:2023/01/01" -m 15000
```
---
## 🚀 EXTRACTION STATUS
### Previous Run (2026-02-08)
- **Batch 363:** 0 new (cron) - `before:2005` saturated
- **Unique IDs extracted:** 484,382 (63.8%)
- **Status:** ⚠️ Gaps identified - see investigation above
### Performance
| Approach | Rate | Speedup |
|----------|------|---------|
| mcporter (old) | 1.5/sec | — |
| fast-extract.mjs | **18-22/sec** | **12-15x** |
---
## Key Files
- `progress.json` — Machine-readable progress
- `scripts/fast-extract.mjs` — Fast single-batch extraction (OAuth refresh fixed!)
- `scripts/full-extraction.mjs` — Production wrapper
- `raw/fast-batch-*.jsonl` — Extraction batches
- `processed/domain-blacklist.json` — 159 filtered domains
- `processed/all-contacts.jsonl` — Accumulated contacts
## Recent Fixes
**2026-02-06:** Fixed OAuth token refresh in `fast-extract.mjs`
- Replaced interactive browser-based OAuth with Google OAuth2 API
- Now works reliably in cron context
- Automatically refreshes expired tokens during extraction
---
*Last updated: 2026-02-08 (batch 362 — EXTRACTION CONFIRMED COMPLETE — 484,382 unique emails extracted, 968 contacts)*