relationship-archaeology

Overview Tasks Crons Activity Chat Raw MD

Raw STATUS.md

# relationship-archaeology

**Status:** ✅ Extraction Complete  
**Phase:** Phase 1 — Full Extraction (Complete: 743,395 of 743,425 messages - 99.996%)  
**Last Activity:** 2026-02-09 (Merged all batches - extraction finished)

---

## Linear Metadata

**Project ID:** `a089175b-c479-4c07-996e-e2e70ea0238a`
**Team ID:** `96b685fe-2252-47c5-97ee-273d8c484942`
**Last Synced:** `2026-02-06T13:22:36.565Z`

## Overview

Analyze 750K+ Gmail messages to surface long-term relationships, communication patterns, and reconnection opportunities.

---

## 🚨 275K "MISSING" EMAILS INVESTIGATION (2026-02-08)

### Root Cause: FOUND ✅

**The 275K emails are NOT deleted/archived — they ARE accessible via Gmail API.**

Investigation confirmed Gmail API can access ALL 760,372 messages. The gap is due to **extraction query gaps**, NOT Gmail limitations.

### Year-by-Year Gap Analysis

| Year | Gmail | Extracted | Gap | % Captured |
|------|-------|-----------|-----|------------|
| 2004 | 16,059 | 9,785 | 6,274 | 60.9% ⚠️ |
| 2005 | 23,840 | 7,923 | 15,917 | 33.2% ⚠️ |
| 2006 | 15,679 | 6,310 | 9,369 | 40.2% ⚠️ |
| 2007 | 19,329 | 18,003 | 1,326 | 93.1% |
| **2008** | **7,560** | **1,601** | **5,959** | **21.2%** ⚠️ |
| **2009** | **30,498** | **1,690** | **28,808** | **5.5%** ⚠️ |
| **2010** | **35,270** | **1** | **35,269** | **0.0%** ⚠️ |
| 2011 | 77,491 | 64,636 | 12,855 | 83.4% ⚠️ |
| 2012 | 35,964 | 35,894 | 70 | 99.8% ✅ |
| 2013 | 43,909 | 41,793 | 2,116 | 95.2% ✅ |
| 2014 | 31,287 | 30,814 | 473 | 98.5% ✅ |
| 2015 | 29,026 | 27,830 | 1,196 | 95.9% ✅ |
| 2016 | 35,755 | 34,075 | 1,680 | 95.3% ✅ |
| 2017 | 46,988 | 43,693 | 3,295 | 93.0% ✅ |
| 2018 | 50,724 | 42,361 | 8,363 | 83.5% ⚠️ |
| 2019 | 34,517 | 31,486 | 3,031 | 91.2% ✅ |
| 2020 | 36,724 | 35,630 | 1,094 | 97.0% ✅ |
| 2021 | 36,457 | 26,296 | 10,161 | 72.1% ⚠️ |
| 2022 | 35,703 | 23,420 | 12,283 | 65.6% ⚠️ |
| 2023 | 33,816 | 18,446 | 15,370 | 54.5% ⚠️ |
| 2024 | 31,639 | 17,454 | 14,185 | 55.2% ⚠️ |
| 2025 | 32,030 | 30,514 | 1,516 | 95.3% ✅ |
| 2026 | 3,087 | 1,914 | 1,173 | 62.0% |
| **TOTAL** | **743,352** | **551,569** | **191,783** | **74.2%** |

### Worst Gaps (Priority Order)

1. **2010: 35,269 missing** (0% captured!) - Extraction jumped from 2011/04/02 to `before:2010`
2. **2009: 28,808 missing** (5.5% captured)
3. **2005: 15,917 missing** (33.2% captured)
4. **2023: 15,370 missing** (54.5% captured)
5. **2024: 14,185 missing** (55.2% captured)
6. **2011: 12,855 missing** (83.4% captured)
7. **2022: 12,283 missing** (65.6% captured)
8. **2021: 10,161 missing** (72.1% captured)

---

## ✅ EXTRACTION COMPLETE (2026-02-09)

### Final Results

**Master File:** `raw/all-extracted-deduplicated.jsonl`  
**Total Messages:** 743,395 (99.996% of 743,425 found)  
**Gap:** Only 30 messages missing (0.004%)  
**File Size:** ~442MB  
**Format:** JSONL (one JSON object per line)

### Enhanced Message Schema

Each line contains a JSON object with:

```json
{
  // Core identifiers
  "id": "f9ceb28f6fc5476",
  "threadId": "f9ceb28f6fc5476",
  
  // Headers
  "from": "Williams-Sonoma <customer@service.williams-sonoma.com>",
  "to": "jgirard01@sprynet.com",
  "cc": "",
  "bcc": "",
  "subject": "Order Confirmation",
  "date": "31 Dec 2003 23:51:04 +0000",
  
  // Reply chain metadata
  "inReplyTo": "",
  "references": "",
  "messageId": "<...>",
  
  // Gmail metadata
  "snippet": "Order# 033657322672 Dear John...",
  "labelIds": ["Label_3541103461816185034"],
  
  // Derived fields
  "direction": "received",  // or "sent" or "unknown"
  "fromEmail": "customer@service.williams-sonoma.com",
  "domain": "service.williams-sonoma.com",
  
  // Filtering
  "blacklisted": false,
  "blacklistReason": null
}
```

### Ready for Phase 2

With extraction complete, the project is ready for:
1. **Contact Aggregation** - Build unified contact database with relationship metrics
2. **Monthly Heat Maps** - Sum conversation volume by contact per month (2003-2026)
3. **Relationship Categorization** - Classify contacts (work, friend, family, romantic, etc.)
4. **Relationship Scoring** - Multi-dimensional analysis for reconnection discovery
5. **Visualization & Export** - Timeline analysis, heat maps, interactive dashboard

**Linear Epic:** [MIN-1097](https://linear.app/hippochat/issue/MIN-1097/phase-2-contact-aggregation-and-relationship-analysis)  
**Design Doc:** `docs/phase2-features.md`

---

## 🔄 DAY-BY-DAY EXTRACTION RUN (2026-02-09) [COMPLETED]

### Status: Resumed & Running ✅

**Original Start:** 2026-02-08 19:07 PST  
**Crashed:** ~2026-02-09 03:50 PST (at 700,184/743,425 - 94.2%)  
**Resumed:** 2026-02-09 04:45 PST  
**Current Progress:** 575,708 already extracted + 501/167,701 gaps being filled (77.5% total)  
**Last Updated:** 2026-02-09 04:56 PST  
**Process ID:** 43101  
**Failed Messages:** 0 (excellent API stability)

### Scan Phase Complete ✅
- **Total Messages Found:** 743,425 across 8,076 days
- **Date Range:** 2003-12-31 to 2026-02-07
- **Days with Messages:** 8,071

### Extraction Phase: Gap-Fill in Progress
- **Already Extracted:** 575,708 messages (found in partial files)
- **Missing (gaps):** 167,701 messages across 7,246 days
- **Currently Fetching:** 501/167,701 (0.3% of gaps, 77.5% overall)
- **Process Status:** Running (PID 43101) with adaptive concurrency
- **Checkpoint System:** ✅ Working (saves to `state/extract-state.jsonl`)
- **Estimated Completion:** ~8-10 hours at current rate

### Technical Fixes (2026-02-09)
- **Problem:** The extraction process was crashing due to excessive memory usage. The checkpoint file (`state/extract-state.json`) was being fully read into memory and rewritten on every checkpoint, causing a failure when the file grew too large.
- **Solution:** Re-engineered the checkpoint mechanism in `scripts/day-by-day-extract.mjs` to use a memory-efficient streaming format.
  - The checkpoint file is now treated as a JSONL stream. A `.json` extension is kept for simplicity, but the internal format is line-delimited.
  - Instead of reading the entire file, the script now reads it line by line.
  - Instead of rewriting the entire file, the script now appends only new records.
- **Impact:** This resolves the memory crash and allows the extraction to run to completion reliably.

### What Happened
1. **Original extraction** (Feb 8 19:07 - Feb 9 03:50): Extracted 700,184 messages before crashing/stopping
2. **Gap analysis** (Feb 9 04:45): Script identified 575,708 unique messages already extracted
3. **Resume started** (Feb 9 04:45): Now extracting 167,701 missing messages to complete dataset

### Key Findings
- Gmail API can access ALL 743,425 messages (confirmed via day-by-day scan)
- Extraction successfully wrote 700K+ messages to partial file before stopping
- Script's gap detection working perfectly - identified 167K missing messages
- Adaptive rate limiting preventing API issues (0 failures so far)

### Current Status (2026-02-09 04:56 PST)
1. ✅ **Extraction resumed successfully** - Process PID 43101, 0 failures
2. ⏳ **In progress** - 501/167,701 gaps (0.3%), 575,708 already done (77.5% overall)
3. 📊 **Rate stable** - Adaptive concurrency (20-50) managing API limits well
4. 💾 **Checkpoint active** - Auto-saves progress, safe to resume if interrupted again

### Next Steps After Completion
1. **Verify extraction** - Run `--verify` to confirm all 743,425 messages captured
2. **Aggregate contacts** - Run `--aggregate` phase for relationship analysis
3. **Run analysis** - Execute MIN-518 (Relationship Analysis Engine)

### Technical Notes
- **Script:** `scripts/day-by-day-extract.mjs --extract` (resumed from checkpoint)
- **Checkpoint:** `state/extract-state.jsonl` (re-engineered to fix memory crash)
- **Output:** `raw/enhanced-batch-[timestamp].jsonl` (written at completion)
- **Scan Results:** `state/day-scan-results.json` (19MB, 8,071 days scanned)
- **Process PID:** 98246 (stable, 0 failures so far)
- **Monitor:** Run `scripts/monitor-extraction.sh` to check progress

---

## 🚀 ENHANCED EXTRACTION SCHEMA (2026-02-08)

### Implementation Complete ✅

| Component | Status | Details |
|-----------|--------|---------|
| **Blacklist Fix** | ✅ | gmail.com removed (was incorrectly filtered) |
| **Direction Detection** | ✅ | Sent/received based on John's addresses |
| **CC/BCC Headers** | ✅ | Full cc/bcc lists captured |
| **Gmail Labels** | ✅ | labelIds array captured |
| **Thread References** | ✅ | In-Reply-To, References, Message-ID |
| **Contact Aggregation** | ✅ | Reciprocity, thread depth, CSV export |

### John's Email Addresses (Direction Detection)

```javascript
const JOHN_EMAILS = [
  'jgirard@gmail.com',
  'john@clickability.com',
  'jgirard01@sprynet.com',
  'jgirard@socialrep.net',
  'jgirard@stanfordalumni.org',
  'john.girard@embarke.com',
  'jgirard@mtell.com',
  'jgirard@manumatix.com',
  'john@pacificcoasttriage.com',
  'johng@boxxcorp.com',
  'john@cience.com',
  'meeting@tungle.com',
  'jg@vitaledge.io',
  'jgirard@zingle.me',
  'jgirard01@icloud.com',
  'jgirard@zingleme.com',
  'john@echopath.co',
  'jg@tastradesoft.com',
  'jgirard@alumni.stanford.edu',
  'jgirard@alumni.stanford.org',
  'john@fluxbanking.com',
  'john@writeforme.io',
  'jgirard@connectifier.com',
  'john@blocktalk.com',
  'johng@nvrmnd.life',
  'john@cience.net',
  'john.girard@dbaplatform.com',
  'jgirard@brightloop.com',
  'john@leasepace.com'
];
```

### Enhanced Message Schema

```javascript
{
  // Core identifiers
  id, threadId,
  
  // Headers (enhanced)
  from, to, cc, bcc, subject, date,
  
  // Reply chain (NEW)
  inReplyTo, references, messageId,
  
  // Gmail metadata
  snippet, labelIds,  // labelIds is NEW
  
  // Derived fields (NEW)
  direction,  // 'sent' | 'received'
  fromEmail, domain,
  
  // Filtering
  blacklisted, blacklistReason
}
```

### Enhanced Contact Schema

```javascript
{
  email, displayName, domain,
  firstSeen, lastSeen,
  messageCount, sentCount, receivedCount,  // NEW: direction counts
  reciprocityRatio,  // NEW: sent/received ratio
  threadCount, avgThreadDepth,  // NEW: thread metrics
  hasReplied, hasReceivedReply  // NEW: engagement flags
}
```

### Next Steps

1. **Run enhanced extraction:**
   ```bash
   node scripts/day-by-day-extract.mjs --full
   ```

2. **Analysis phase:** See MIN-518 (Relationship Analysis & Characterization Engine)

---

### Fix Plan: Day-by-Day Extraction (Designed 2026-02-08)

**Linear Issue:** [MIN-514](https://linear.app/hippochat/issue/MIN-514/implement-day-by-day-extraction-for-zero-gap-email-capture)

**Design Doc:** `DAY-BY-DAY-EXTRACTION-DESIGN.md`

**Strategy:** Query each day individually to guarantee zero gaps:
```
after:YYYY/MM/DD before:YYYY/MM/DD+1
```

**Why this works:**
- Each day is atomic (no boundary ambiguity)
- Every email has exactly one date
- Maximum ~500 emails/day (well within API limits)

**Time Estimate:** ~2.5 hours for 192K gap fill

**Gmail API Limits (via Context7):**
- Per-user: 15,000 quota units/minute
- messages.list/get: 5 units each
- Max sustainable rate: 30-40 messages/sec

**Implementation:** `scripts/day-by-day-extract.mjs` (to be built)

---

### Legacy Fix Plan (Superseded)

Run `fast-extract.mjs` with targeted year queries:
```bash
# Priority 1: 2008-2010 gap (70K emails)
node scripts/fast-extract.mjs -q "after:2010/01/01 before:2011/01/01" -m 40000
node scripts/fast-extract.mjs -q "after:2009/01/01 before:2010/01/01" -m 35000
node scripts/fast-extract.mjs -q "after:2008/01/01 before:2009/01/01" -m 10000

# Priority 2: Recent years (40K+ emails)
node scripts/fast-extract.mjs -q "after:2023/01/01 before:2024/01/01" -m 20000
node scripts/fast-extract.mjs -q "after:2024/01/01 before:2025/01/01" -m 20000

# Priority 3: Other gaps
node scripts/fast-extract.mjs -q "after:2021/01/01 before:2022/01/01" -m 15000
node scripts/fast-extract.mjs -q "after:2022/01/01 before:2023/01/01" -m 15000
```

---

## 🚀 EXTRACTION STATUS

### Previous Run (2026-02-08)
- **Batch 363:** 0 new (cron) - `before:2005` saturated
- **Unique IDs extracted:** 484,382 (63.8%)
- **Status:** ⚠️ Gaps identified - see investigation above

### Performance
| Approach | Rate | Speedup |
|----------|------|---------|
| mcporter (old) | 1.5/sec | — |
| fast-extract.mjs | **18-22/sec** | **12-15x** |

---

## Key Files

- `progress.json` — Machine-readable progress
- `scripts/fast-extract.mjs` — Fast single-batch extraction (OAuth refresh fixed!)
- `scripts/full-extraction.mjs` — Production wrapper
- `raw/fast-batch-*.jsonl` — Extraction batches
- `processed/domain-blacklist.json` — 159 filtered domains
- `processed/all-contacts.jsonl` — Accumulated contacts

## Recent Fixes

**2026-02-06:** Fixed OAuth token refresh in `fast-extract.mjs`
- Replaced interactive browser-based OAuth with Google OAuth2 API
- Now works reliably in cron context
- Automatically refreshes expired tokens during extraction

---

*Last updated: 2026-02-08 (batch 362 — EXTRACTION CONFIRMED COMPLETE — 484,382 unique emails extracted, 968 contacts)*

Last activity: 2026-04-02

Loading project...

Raw STATUS.md

# relationship-archaeology **Status:** ✅ Extraction Complete **Phase:** Phase 1 — Full Extraction (Complete: 743,395 of 743,425 messages - 99.996%) **Last Activity:** 2026-02-09 (Merged all batches - extraction finished) --- ## Linear Metadata **Project ID:** `a089175b-c479-4c07-996e-e2e70ea0238a` **Team ID:** `96b685fe-2252-47c5-97ee-273d8c484942` **Last Synced:** `2026-02-06T13:22:36.565Z` ## Overview Analyze 750K+ Gmail messages to surface long-term relationships, communication patterns, and reconnection opportunities. --- ## 🚨 275K "MISSING" EMAILS INVESTIGATION (2026-02-08) ### Root Cause: FOUND ✅ **The 275K emails are NOT deleted/archived — they ARE accessible via Gmail API.** Investigation confirmed Gmail API can access ALL 760,372 messages. The gap is due to **extraction query gaps**, NOT Gmail limitations. ### Year-by-Year Gap Analysis | Year | Gmail | Extracted | Gap | % Captured | |------|-------|-----------|-----|------------| | 2004 | 16,059 | 9,785 | 6,274 | 60.9% ⚠️ | | 2005 | 23,840 | 7,923 | 15,917 | 33.2% ⚠️ | | 2006 | 15,679 | 6,310 | 9,369 | 40.2% ⚠️ | | 2007 | 19,329 | 18,003 | 1,326 | 93.1% | | **2008** | **7,560** | **1,601** | **5,959** | **21.2%** ⚠️ | | **2009** | **30,498** | **1,690** | **28,808** | **5.5%** ⚠️ | | **2010** | **35,270** | **1** | **35,269** | **0.0%** ⚠️ | | 2011 | 77,491 | 64,636 | 12,855 | 83.4% ⚠️ | | 2012 | 35,964 | 35,894 | 70 | 99.8% ✅ | | 2013 | 43,909 | 41,793 | 2,116 | 95.2% ✅ | | 2014 | 31,287 | 30,814 | 473 | 98.5% ✅ | | 2015 | 29,026 | 27,830 | 1,196 | 95.9% ✅ | | 2016 | 35,755 | 34,075 | 1,680 | 95.3% ✅ | | 2017 | 46,988 | 43,693 | 3,295 | 93.0% ✅ | | 2018 | 50,724 | 42,361 | 8,363 | 83.5% ⚠️ | | 2019 | 34,517 | 31,486 | 3,031 | 91.2% ✅ | | 2020 | 36,724 | 35,630 | 1,094 | 97.0% ✅ | | 2021 | 36,457 | 26,296 | 10,161 | 72.1% ⚠️ | | 2022 | 35,703 | 23,420 | 12,283 | 65.6% ⚠️ | | 2023 | 33,816 | 18,446 | 15,370 | 54.5% ⚠️ | | 2024 | 31,639 | 17,454 | 14,185 | 55.2% ⚠️ | | 2025 | 32,030 | 30,514 | 1,516 | 95.3% ✅ | | 2026 | 3,087 | 1,914 | 1,173 | 62.0% | | **TOTAL** | **743,352** | **551,569** | **191,783** | **74.2%** | ### Worst Gaps (Priority Order) 1. **2010: 35,269 missing** (0% captured!) - Extraction jumped from 2011/04/02 to `before:2010` 2. **2009: 28,808 missing** (5.5% captured) 3. **2005: 15,917 missing** (33.2% captured) 4. **2023: 15,370 missing** (54.5% captured) 5. **2024: 14,185 missing** (55.2% captured) 6. **2011: 12,855 missing** (83.4% captured) 7. **2022: 12,283 missing** (65.6% captured) 8. **2021: 10,161 missing** (72.1% captured) --- ## ✅ EXTRACTION COMPLETE (2026-02-09) ### Final Results **Master File:** `raw/all-extracted-deduplicated.jsonl` **Total Messages:** 743,395 (99.996% of 743,425 found) **Gap:** Only 30 messages missing (0.004%) **File Size:** ~442MB **Format:** JSONL (one JSON object per line) ### Enhanced Message Schema Each line contains a JSON object with: ```json { // Core identifiers "id": "f9ceb28f6fc5476", "threadId": "f9ceb28f6fc5476", // Headers "from": "Williams-Sonoma <customer@service.williams-sonoma.com>", "to": "jgirard01@sprynet.com", "cc": "", "bcc": "", "subject": "Order Confirmation", "date": "31 Dec 2003 23:51:04 +0000", // Reply chain metadata "inReplyTo": "", "references": "", "messageId": "<...>", // Gmail metadata "snippet": "Order# 033657322672 Dear John...", "labelIds": ["Label_3541103461816185034"], // Derived fields "direction": "received", // or "sent" or "unknown" "fromEmail": "customer@service.williams-sonoma.com", "domain": "service.williams-sonoma.com", // Filtering "blacklisted": false, "blacklistReason": null } ``` ### Ready for Phase 2 With extraction complete, the project is ready for: 1. **Contact Aggregation** - Build unified contact database with relationship metrics 2. **Monthly Heat Maps** - Sum conversation volume by contact per month (2003-2026) 3. **Relationship Categorization** - Classify contacts (work, friend, family, romantic, etc.) 4. **Relationship Scoring** - Multi-dimensional analysis for reconnection discovery 5. **Visualization & Export** - Timeline analysis, heat maps, interactive dashboard **Linear Epic:** [MIN-1097](https://linear.app/hippochat/issue/MIN-1097/phase-2-contact-aggregation-and-relationship-analysis) **Design Doc:** `docs/phase2-features.md` --- ## 🔄 DAY-BY-DAY EXTRACTION RUN (2026-02-09) [COMPLETED] ### Status: Resumed & Running ✅ **Original Start:** 2026-02-08 19:07 PST **Crashed:** ~2026-02-09 03:50 PST (at 700,184/743,425 - 94.2%) **Resumed:** 2026-02-09 04:45 PST **Current Progress:** 575,708 already extracted + 501/167,701 gaps being filled (77.5% total) **Last Updated:** 2026-02-09 04:56 PST **Process ID:** 43101 **Failed Messages:** 0 (excellent API stability) ### Scan Phase Complete ✅ - **Total Messages Found:** 743,425 across 8,076 days - **Date Range:** 2003-12-31 to 2026-02-07 - **Days with Messages:** 8,071 ### Extraction Phase: Gap-Fill in Progress - **Already Extracted:** 575,708 messages (found in partial files) - **Missing (gaps):** 167,701 messages across 7,246 days - **Currently Fetching:** 501/167,701 (0.3% of gaps, 77.5% overall) - **Process Status:** Running (PID 43101) with adaptive concurrency - **Checkpoint System:** ✅ Working (saves to `state/extract-state.jsonl`) - **Estimated Completion:** ~8-10 hours at current rate ### Technical Fixes (2026-02-09) - **Problem:** The extraction process was crashing due to excessive memory usage. The checkpoint file (`state/extract-state.json`) was being fully read into memory and rewritten on every checkpoint, causing a failure when the file grew too large. - **Solution:** Re-engineered the checkpoint mechanism in `scripts/day-by-day-extract.mjs` to use a memory-efficient streaming format. - The checkpoint file is now treated as a JSONL stream. A `.json` extension is kept for simplicity, but the internal format is line-delimited. - Instead of reading the entire file, the script now reads it line by line. - Instead of rewriting the entire file, the script now appends only new records. - **Impact:** This resolves the memory crash and allows the extraction to run to completion reliably. ### What Happened 1. **Original extraction** (Feb 8 19:07 - Feb 9 03:50): Extracted 700,184 messages before crashing/stopping 2. **Gap analysis** (Feb 9 04:45): Script identified 575,708 unique messages already extracted 3. **Resume started** (Feb 9 04:45): Now extracting 167,701 missing messages to complete dataset ### Key Findings - Gmail API can access ALL 743,425 messages (confirmed via day-by-day scan) - Extraction successfully wrote 700K+ messages to partial file before stopping - Script's gap detection working perfectly - identified 167K missing messages - Adaptive rate limiting preventing API issues (0 failures so far) ### Current Status (2026-02-09 04:56 PST) 1. ✅ **Extraction resumed successfully** - Process PID 43101, 0 failures 2. ⏳ **In progress** - 501/167,701 gaps (0.3%), 575,708 already done (77.5% overall) 3. 📊 **Rate stable** - Adaptive concurrency (20-50) managing API limits well 4. 💾 **Checkpoint active** - Auto-saves progress, safe to resume if interrupted again ### Next Steps After Completion 1. **Verify extraction** - Run `--verify` to confirm all 743,425 messages captured 2. **Aggregate contacts** - Run `--aggregate` phase for relationship analysis 3. **Run analysis** - Execute MIN-518 (Relationship Analysis Engine) ### Technical Notes - **Script:** `scripts/day-by-day-extract.mjs --extract` (resumed from checkpoint) - **Checkpoint:** `state/extract-state.jsonl` (re-engineered to fix memory crash) - **Output:** `raw/enhanced-batch-[timestamp].jsonl` (written at completion) - **Scan Results:** `state/day-scan-results.json` (19MB, 8,071 days scanned) - **Process PID:** 98246 (stable, 0 failures so far) - **Monitor:** Run `scripts/monitor-extraction.sh` to check progress --- ## 🚀 ENHANCED EXTRACTION SCHEMA (2026-02-08) ### Implementation Complete ✅ | Component | Status | Details | |-----------|--------|---------| | **Blacklist Fix** | ✅ | gmail.com removed (was incorrectly filtered) | | **Direction Detection** | ✅ | Sent/received based on John's addresses | | **CC/BCC Headers** | ✅ | Full cc/bcc lists captured | | **Gmail Labels** | ✅ | labelIds array captured | | **Thread References** | ✅ | In-Reply-To, References, Message-ID | | **Contact Aggregation** | ✅ | Reciprocity, thread depth, CSV export | ### John's Email Addresses (Direction Detection) ```javascript const JOHN_EMAILS = [ 'jgirard@gmail.com', 'john@clickability.com', 'jgirard01@sprynet.com', 'jgirard@socialrep.net', 'jgirard@stanfordalumni.org', 'john.girard@embarke.com', 'jgirard@mtell.com', 'jgirard@manumatix.com', 'john@pacificcoasttriage.com', 'johng@boxxcorp.com', 'john@cience.com', 'meeting@tungle.com', 'jg@vitaledge.io', 'jgirard@zingle.me', 'jgirard01@icloud.com', 'jgirard@zingleme.com', 'john@echopath.co', 'jg@tastradesoft.com', 'jgirard@alumni.stanford.edu', 'jgirard@alumni.stanford.org', 'john@fluxbanking.com', 'john@writeforme.io', 'jgirard@connectifier.com', 'john@blocktalk.com', 'johng@nvrmnd.life', 'john@cience.net', 'john.girard@dbaplatform.com', 'jgirard@brightloop.com', 'john@leasepace.com' ]; ``` ### Enhanced Message Schema ```javascript { // Core identifiers id, threadId, // Headers (enhanced) from, to, cc, bcc, subject, date, // Reply chain (NEW) inReplyTo, references, messageId, // Gmail metadata snippet, labelIds, // labelIds is NEW // Derived fields (NEW) direction, // 'sent' | 'received' fromEmail, domain, // Filtering blacklisted, blacklistReason } ``` ### Enhanced Contact Schema ```javascript { email, displayName, domain, firstSeen, lastSeen, messageCount, sentCount, receivedCount, // NEW: direction counts reciprocityRatio, // NEW: sent/received ratio threadCount, avgThreadDepth, // NEW: thread metrics hasReplied, hasReceivedReply // NEW: engagement flags } ``` ### Next Steps 1. **Run enhanced extraction:** ```bash node scripts/day-by-day-extract.mjs --full ``` 2. **Analysis phase:** See MIN-518 (Relationship Analysis & Characterization Engine) --- ### Fix Plan: Day-by-Day Extraction (Designed 2026-02-08) **Linear Issue:** [MIN-514](https://linear.app/hippochat/issue/MIN-514/implement-day-by-day-extraction-for-zero-gap-email-capture) **Design Doc:** `DAY-BY-DAY-EXTRACTION-DESIGN.md` **Strategy:** Query each day individually to guarantee zero gaps: ``` after:YYYY/MM/DD before:YYYY/MM/DD+1 ``` **Why this works:** - Each day is atomic (no boundary ambiguity) - Every email has exactly one date - Maximum ~500 emails/day (well within API limits) **Time Estimate:** ~2.5 hours for 192K gap fill **Gmail API Limits (via Context7):** - Per-user: 15,000 quota units/minute - messages.list/get: 5 units each - Max sustainable rate: 30-40 messages/sec **Implementation:** `scripts/day-by-day-extract.mjs` (to be built) --- ### Legacy Fix Plan (Superseded) Run `fast-extract.mjs` with targeted year queries: ```bash # Priority 1: 2008-2010 gap (70K emails) node scripts/fast-extract.mjs -q "after:2010/01/01 before:2011/01/01" -m 40000 node scripts/fast-extract.mjs -q "after:2009/01/01 before:2010/01/01" -m 35000 node scripts/fast-extract.mjs -q "after:2008/01/01 before:2009/01/01" -m 10000 # Priority 2: Recent years (40K+ emails) node scripts/fast-extract.mjs -q "after:2023/01/01 before:2024/01/01" -m 20000 node scripts/fast-extract.mjs -q "after:2024/01/01 before:2025/01/01" -m 20000 # Priority 3: Other gaps node scripts/fast-extract.mjs -q "after:2021/01/01 before:2022/01/01" -m 15000 node scripts/fast-extract.mjs -q "after:2022/01/01 before:2023/01/01" -m 15000 ``` --- ## 🚀 EXTRACTION STATUS ### Previous Run (2026-02-08) - **Batch 363:** 0 new (cron) - `before:2005` saturated - **Unique IDs extracted:** 484,382 (63.8%) - **Status:** ⚠️ Gaps identified - see investigation above ### Performance | Approach | Rate | Speedup | |----------|------|---------| | mcporter (old) | 1.5/sec | — | | fast-extract.mjs | **18-22/sec** | **12-15x** | --- ## Key Files - `progress.json` — Machine-readable progress - `scripts/fast-extract.mjs` — Fast single-batch extraction (OAuth refresh fixed!) - `scripts/full-extraction.mjs` — Production wrapper - `raw/fast-batch-*.jsonl` — Extraction batches - `processed/domain-blacklist.json` — 159 filtered domains - `processed/all-contacts.jsonl` — Accumulated contacts ## Recent Fixes **2026-02-06:** Fixed OAuth token refresh in `fast-extract.mjs` - Replaced interactive browser-based OAuth with Google OAuth2 API - Now works reliably in cron context - Automatically refreshes expired tokens during extraction --- *Last updated: 2026-02-08 (batch 362 — EXTRACTION CONFIRMED COMPLETE — 484,382 unique emails extracted, 968 contacts)*