nvrbot-scraper

https://docs.google.com/spreadsheets/d/1oE9MkZzYA4FqODqi50cBQ5GlqBWYRHDoGXVm9tJIEGI/edit

Overview Tasks Crons Activity Chat Raw MD

Raw STATUS.md

# nvrbot-scraper

**Status:** 🟢 Active
**Phase:** Daily Scraping + Cleanup Pipeline Implemented
**Last Activity:** 2026-03-11

---

## Linear Metadata

**Project ID:** `17c6d70c-e194-41f7-a7e9-644fd91a60b8`
**Team ID:** `96b685fe-2252-47c5-97ee-273d8c484942`
**Last Synced:** `2026-02-06T13:21:08.582Z`

## Overview

Web scraper for extracting class schedule data from competitor wellness studios (sauna, cold plunge, meditation). Used for NVRMND competitive analysis — tracking capacity, instructors, fill rates, class types.

**Source Code:** `/home/john/projects/superscaper`

---

## Scraper Status (2026-03-27)

**Last Update:** 2026-03-27 19:51 PST  
**Status:** ✅ **ALL CURRENT - Daily Scraping Active**  
**Mode:** Daily forward scraping (nightly at 11pm PT) with auto-merge to master.json

| Studio | Last Scraped | Status | Notes |
|--------|--------------|--------|-------|
| BD_CARLSBAD | 2026-03-27 | ✅ Current | Backfill complete Nov 2024 → Mar 2026 |
| OS_ADELAIDE | 2026-03-27 | ✅ Current | Backfill complete (split from OS_TORONTO) |
| OS_YORKVILLE | 2026-03-27 | ✅ Current | Backfill complete (split from OS_TORONTO) |
| OS_FLATIRON | 2026-03-27 | ✅ Current | Backfill complete Nov 2024 → Mar 2026 |
| BD_LIBERTY | 2026-03-27 | ✅ Current | Backfill complete Dec 2025 → Mar 2026 |
| OS_WILLIAMSBURG | 2026-03-27 | ✅ Current | Backfill complete Nov 2024 → Mar 2026 |

### Daily Scraping

**Last successful scrape:** 2026-03-27 19:51:00 PST  
**Records collected:** 121 new records (100% valid, 0 errors)  
**Output file:** `nvrbot_scrape_20260327.json`  
**Errors:** None

**Breakdown by studio:**
- BD Carlsbad: 22 classes
- BD Liberty Station: 20 classes
- OtherShip Adelaide: 17 classes
- OtherShip Flatiron: 21 classes
- OtherShip Williamsburg: 23 classes
- OtherShip Yorkville: 18 classes

### Master Data

**Location:** `/home/john/projects/superscaper/processed/master.json`  
**Total records:** ~99,729  
**Date range:** 2021-07-05 → 2026-02-14  
**File size:** ~78.9 MB  
**Last merged:** 2026-02-14 23:02 PST  
**Auto-merge:** ✅ Enabled (runs after each scrape)

**Removed:** MZ_MYRTLE (no classes found on platform), ST_YONGE, ST_FRONT (no availability data exposed)

**Spot Check Results (Jan 20 - Feb 2, 2026):**
- 1,730 records scraped across 6 studios
- 100% validation pass rate (0 errors)
- Adelaide/Yorkville separation confirmed
- Output format verified

**Critical Bug Fixed (2026-02-03 18:20):**
- ⚠️ **Data loss bug:** Scraper was overwriting same file on multiple runs per day
## Data Locations (Complete Inventory)

### Master Data (Single Source of Truth)

| File | Records | Date Range | Size | Last Updated |
|------|---------|------------|------|--------------|
| **`processed/master.json`** | **99,488** | **2021-07-05 → 2026-02-13** | **78.8 MB** | **2026-02-14 09:52 PST** |

⭐ **This is the canonical local dataset** — auto-updated nightly after each scrape.

### Daily Scrape Files (superscaper directory)

| File | Records | Date Range | Notes |
|------|---------|------------|-------|
| `nvrbot_scrape_20260213.json` | 392 | Feb 11-13, 2026 | Latest daily scrape |
| `nvrbot_scrape_20241106.json` | 13,838 | May 2 → Nov 4, 2024 | Historical backfill |
| `nvrbot_scrape_20260204.json` | 35,679 | Jan 1 → Feb 3, 2026 | 2025 full year backfill |

**Path:** `/home/john/projects/superscaper/`

### Legacy Files (Historical)

| File | Records | Date Range | Notes |
|------|---------|------------|-------|
| `BDfull.json` | 4,099 | May 2 → Nov 5, 2024 | BD legacy export |
| `OSfull.json` | 5,679 | May 2 → Nov 5, 2024 | OS legacy export |

**Note:** Legacy files superseded by `processed/master.json`

### Supabase (Production Database)

**Database:** `classes` table at `ootepdsivzlhqhaielor.supabase.co`

**Auto-sync:** ✅ Enabled (upsert after each nightly scrape)  
**Deduplication:** Composite key (studio_id + class_date + time + class)  
**Current records:** ~99,488+ (auto-updated nightly)

### Google Drive - NVRMND Central (Historical/Deprecated)

**Shared Drive:** `NVRMND Central` (ID: `0AEKvqTBXqb6XUk9PVA`)

**Location:** `Data Dump > JG > 04 Data & Exports > Active`

| File | Type | ID | Records | Notes |
|------|------|-----|---------|-------|
| **nvrbot** | Google Sheet | `1oE9MkZzYA4FqODqi50cBQ5GlqBWYRHDoGXVm9tJIEGI` | 47,832 | ⚠️ **Not auto-updated** (historical) |
| nvrbot_scrape_20241106.csv | CSV | `1hV5HUKt0M1DVyKH2qz2u3igJKLbFpKDi` | 13,838 | Raw backup |
| BDfull | Google Sheet | `1OzW03kQu6SHsVrXdJQmj4qQZzxO_5X-QUj6xdf3rU_s` | — | BD cleaned |
| OSfull.csv | CSV | `1BrFc6-1dBUe36OD-IWLffBpf7wQScaaE` | 5,679 | OS raw |
| MZfull.csv | CSV | `12y6Q3iIeCbyus-YN36amqW_Yvbx0oGID` | ~2,500 | MZ raw (Jun 2024) |

**Note:** Google Sheets are no longer auto-updated. Use Supabase or `processed/master.json` for current data.

## Existing Cleanup Schema (from nvrbot Google Sheet)

### Sheet Structure

| Tab | Rows | Purpose |
|-----|------|---------|
| **fullData** | 47,832 | Main aggregated cleaned data |
| **OS done** | 16,117 | OtherShip cleaned (filter: excludes Yorkville) |
| **BD done** | 25,825 | Breathe Degrees cleaned (filter: excludes Free Flow/Social) |
| **MZ done** | 3,076 | MindZero cleaned |
| **nvrbot inputs** | 1,007 | Raw input staging |
| **locationMap** | 7 | Location normalization lookup |
| **timeMap** | ~100 | Time → military time lookup |
| **UPDATE** | 11,044 | Recent update staging |
| **new data** | 1,903 | New data staging |

### Cleaned Schema (fullData)

| Column | Type | Example | Notes |
|--------|------|---------|-------|
| company | string | "Othership" | Normalized company name |
| classDate | string | "01-31-2022" | MM-DD-YYYY format |
| day | string | "Monday" | Day of week |
| time | string | "3:00 PM" | 12-hour format |
| duration | string | "75" | Minutes (as string) |
| class | string | "Free Flow" | Class name |
| location | string | "Adelaide" | Raw location name |
| type | string | "Free Flow" | Class type category |
| fill | string | "14/17 Open" | Combined availability string |
| classStatus | string | "" | Status field |
| open | string | "14" | Open spots (as string) |
| total | string | "17" | Total spots (as string) |
| filled | string | "3" | Filled spots (as string) |
| instructor | string | "Othership Guide" | Primary instructor |
| instructor2 | string | "" | Secondary instructor |
| instructor3 | string | "" | Third instructor |
| room | string | "Sauna" | Room name |
| url | string | "https://..." | Source URL |
| yoga | string | "" | Yoga class flag |
| saunadotcnt | string | "1" | Counter field |

### Lookup Tables

**locationMap:**
| locationID | location |
|------------|----------|
| Adelaide | OtherShip - Adelaide |
| Yorkville | OtherShip - Yorkville |
| Carlsbad Studio | Breathe Degrees - Carlsbad |
| Myrtle Beach | Mind Zero - Myrtle Beach |
| Mt. Pleasant | Mind Zero - Mt. Pleasant |
| Flatiron | OtherShip - Flatiron |

**timeMap:**
| time | milTime | hour |
|------|---------|------|
| 1:00 PM | 1300 | 13 |
| 10:00 AM | 1000 | 10 |
| ... | ... | ... |

---

## Second-Level Cleanup Spec (DRAFT)

### Goals

1. **Deduplicate** raw data (re-use scraper's deduplication logic)
2. **Normalize** raw scraper output to consistent schema
3. **Match** existing Google Sheet format for continuity
4. **Automate** what was previously done manually
5. **Add** new computed fields for analysis

**Note on Deduplication:** The cleanup pipeline should use the same `generate_record_id()` and `deduplicate_records()` functions from `src/main.py`. This ensures consistency between scraper and cleanup deduplication logic.

### Input Format (Raw Scraper Output)

```json
{
  "time": "8:00 AM",
  "duration": "75",
  "location": "Williamsburg",
  "class": "Guided Down: Senses",
  "instructor": "Becca Jacobs",
  "room": "Sauna",
  "classDate": "01-31-2026",
  "day": "Saturday",
  "open_spots": 58,
  "total_spots": 64,
  "filled_spots": 6,
  "status": "open",
  "type": "Class",
  "company": "OtherShip",
  "url": "https://..."
}
```

### Output Format (Cleaned)

**Option A: Match Existing Google Sheet Schema**
```json
{
  "company": "Othership",
  "classDate": "01-31-2026",
  "day": "Saturday",
  "time": "8:00 AM",
  "duration": "75",
  "class": "Guided Down: Senses",
  "location": "Williamsburg",
  "type": "Guided",
  "fill": "58/64 Open",
  "classStatus": "open",
  "open": "58",
  "total": "64",
  "filled": "6",
  "instructor": "Becca Jacobs",
  "instructor2": "",
  "instructor3": "",
  "room": "Sauna",
  "url": "https://...",
  "yoga": "",
  "saunadotcnt": ""
}
```

**Option B: Enhanced Schema (New)**
```json
{
  "id": "OS_WILLIAMSBURG_2026-01-31_0800",
  "studio_id": "OS_WILLIAMSBURG",
  "company": "OtherShip",
  "location_normalized": "OtherShip - Williamsburg",
  "date_iso": "2026-01-31",
  "date_display": "01-31-2026",
  "day_of_week": "Saturday",
  "time_12h": "8:00 AM",
  "time_24h": "08:00",
  "hour": 8,
  "duration_min": 75,
  "class_name": "Guided Down: Senses",
  "class_type": "Guided",
  "class_category": "Sauna",
  "instructor": "Becca Jacobs",
  "instructor_normalized": "becca_jacobs",
  "instructor2": null,
  "instructor3": null,
  "room": "Sauna",
  "open_spots": 58,
  "total_spots": 64,
  "filled_spots": 6,
  "fill_rate": 0.094,
  "fill_display": "58/64 Open",
  "status": "open",
  "is_waitlist": false,
  "is_full": false,
  "url": "https://...",
  "scraped_at": "2026-02-01T16:23:00Z"
}
```

### Transformations Required

**Pipeline order:** (Run in sequence)

| Step | Transformation | Complexity | Notes |
|------|----------------|------------|-------|
| **0. Deduplication** | Remove duplicate records | Low | **FIRST STEP** - Use scraper's `generate_record_id()` logic |
| **1. Company** | Normalize capitalization | Low | "OtherShip" → "Othership" |
| **2. Date** | Parse MM-DD-YYYY → ISO | Low | "01-31-2026" → "2026-01-31" |
| **3. Time** | Parse to 24h, extract hour | Low | "8:00 AM" → "08:00", hour: 8 |
| **4. Duration** | String → int | Low | "75" → 75 |
| **5. Location** | Lookup → normalized name | Medium | "Williamsburg" → "OtherShip - Williamsburg" |
| **6. Type** | Extract from class name | Medium | "Guided Down" → "Guided" |
| **7. Instructor** | Split multiple, normalize | Medium | "Arkaya \| Elly" → instructor, instructor2 |
| **8. Fill rate** | Calculate filled/total | Low | 6/64 → 0.094 |
| **9. Class category** | Infer from room/class name | Medium | room: "Sauna" → category: "Sauna" |
| **10. Yoga flag** | Pattern match class name | Low | "Yoga Flow" → yoga: "Y" |

**Critical:** Deduplication MUST run first to prevent duplicate data in cleaned output. Subsequent steps operate on unique records only.

### Class Type Taxonomy (Needs Validation)

Based on existing data patterns:

| Type | Pattern | Examples |
|------|---------|----------|
| **Free Flow** | "Free Flow", "Open" | Open sessions, self-guided |
| **Guided** | "Guided Down", "Guided Up", "Guided All Around" | Instructor-led sauna sessions |
| **Class** | Specific class names | Yoga, HIIT, Breathwork |
| **Social** | "Social" | Community events |
| **Private** | "Private" | Private bookings |
| **Online** | "Online", "Virtual" | Remote classes |

### Instructor Normalization

**Challenges:**
- Multiple instructors in one field: "Arkaya | Elly Ball"
- Generic names: "Othership Guide", "Free Flow Guide"
- Inconsistent formatting: "BECCA JACOBS" vs "Becca Jacobs"

**Approach:**
1. Split on ` | ` delimiter
2. Title case normalization
3. Generate slug: "becca_jacobs"
4. Map generic guides to company defaults

---

## Possible Cleanup Approaches

### Approach 1: Python Script (Recommended)

**Pros:**
- Full control over transformations
- Can run locally or in CI/CD
- Easy to version control
- Can output to multiple formats

**Cons:**
- Need to maintain code
- Separate from Google Sheets workflow

**Implementation:**
```
/projects/nvrbot-scraper/
├── cleanup/
│   ├── transform.py      # Main transformation logic
│   ├── lookups.py        # Location/time mappings
│   ├── validators.py     # Data quality checks
│   └── output.py         # JSON/CSV/Sheets export
```

### Approach 2: Google Sheets Formulas

**Pros:**
- Matches existing workflow
- Non-technical users can modify
- Real-time updates

**Cons:**
- Complex formulas hard to maintain
- Performance issues with large datasets
- Version control difficult

**Implementation:**
- Import raw CSV to "inputs" tab
- Use VLOOKUP for location mapping
- Use formulas to compute derived fields
- Copy-paste values to "done" tabs

### Approach 3: Hybrid (Script + Sheets)

**Pros:**
- Best of both worlds
- Script handles heavy lifting
- Sheets for final review/adjustments

**Cons:**
- Two systems to maintain
- Data sync complexity

**Implementation:**
1. Python script transforms raw → cleaned JSON
2. Script uploads to staging sheet
3. Manual review in Sheets
4. Append to master sheet

### Approach 4: Database Pipeline

**Pros:**
- SQL for analysis
- Scales well
- Can power dashboards

**Cons:**
- More infrastructure
- Overkill for current volume

**Implementation:**
- SQLite or Postgres
- Raw → staging → clean tables
- Views for analysis

---

## Recommended Approach

**Hybrid (Approach 3)** with Python script + Google Sheets integration:

1. **Script** does:
   - Load raw JSON/CSV
   - **Deduplicate records first** (re-use scraper's logic from `src/main.py`)
   - Apply all transformations
   - Validate data quality
   - Output cleaned JSON + CSV
   - Optionally push to Google Sheets staging tab

2. **Google Sheets** for:
   - Visual review
   - Manual corrections
   - Final append to master data
   - Pivot tables and analysis

3. **Automation** via:
   - Cron job for daily scrape
   - Cron job for daily cleanup
   - Alert on data quality issues

---

## Cleanup Pipeline Implementation ✅

**Status:** ✅ **IMPLEMENTED** (2026-02-07)

**Location:** `/home/john/projects/superscaper/cleanup/transform.py`

### Implementation Details

**Pipeline Steps:**
0. **Deduplication** (FIRST STEP - CRITICAL)
   - Uses composite key: `company_location_date_time_class`
   - Keeps LAST occurrence when duplicates found (most recent scrape)
   - Same logic as scraper (`src/main.py` functions)
1. Company name normalization
2. Room field cleaning
3. Instructor classification (named/generic)
4. Waitlist detection
5. Integer type enforcement
6. Fill rate calculation
7. Normalized location field
8. Time parsing (24h format + hour extraction)

**Usage:**
```bash
# JSON output only
python cleanup/transform.py input.json output.json

# JSON + CSV output
python cleanup/transform.py input.json output.json --csv output.csv

# Example with actual files
cd /home/john/projects/superscaper
./venv/bin/python3 cleanup/transform.py nvrbot_scrape_20260206.json cleaned.json --csv cleaned.csv
```

**Testing Results:**

| Test | Input Records | Duplicates Found | Output Records | Errors | Status |
|------|---------------|------------------|----------------|---------|---------|
| Small dataset (Feb 6) | 269 | 0 | 269 | 0 | ✅ Pass |
| Synthetic duplicates | 279 | 10 | 269 | 0 | ✅ Pass |
| Large dataset (Feb 4) | 35,679 | 5 | 35,674 | 0 | ✅ Pass |

**Output Schema:**

The cleaned output includes all original fields plus:
- `instructor_type`: "named" or "generic"
- `instructor_normalized`: slug format (e.g., "becca_jacobs")
- `is_waitlist`: boolean flag
- `fill_rate`: decimal (filled/total)
- `location_normalized`: "Company - Location"
- `time_24h`: 24-hour format (e.g., "08:00")
- `hour`: integer 0-23

**Documentation:**
- `/home/john/projects/superscaper/cleanup/README.md` - Updated with deduplication details
- Pipeline follows spec in STATUS.md "Second-Level Cleanup Spec"

**Next Steps:**
1. ~~Implement cleanup pipeline~~ ✅ DONE
2. Test on production data → ✅ DONE
3. Integrate with daily scraper workflow (optional automation)
4. Add Google Sheets export capability (future enhancement)
5. Schedule daily cleanup cron job (future automation)

---

## Data Quality Issues to Handle

| Issue | Example | Solution |
|-------|---------|----------|
| **Duplicate records** | Same class scraped twice | **Dedupe on composite key** (company_location_date_time_class) - FIRST STEP in pipeline |
| Missing availability | S&T shows 0/0/0 | Flag as "no_data", exclude from fill rate analysis |
| Multiple instructors | "Arkaya \| Elly Ball" | Split to instructor1/2/3 |
| Date format variations | "01-31-2026" vs "2026-01-31" | Normalize to ISO internally |
| Location name changes | "Carlsbad Studio" vs "Carlsbad" | Lookup table normalization |
| Class name variations | "Free Flow" vs "FreeFlow" | Fuzzy matching + manual mapping |
| Timezone issues | EST vs PST studios | Store in local time with TZ indicator |

---

## Current Studios (8 total)

| Studio | Company | Location | Platform | Status |
|--------|---------|----------|----------|--------|
| BD_CARLSBAD | Breathe Degrees | Carlsbad, CA | Mariana Tek | ✅ Working |
| BD_LIBERTY | Breathe Degrees | Liberty Station, SD | Mariana Tek | ✅ Working |
| OS_TORONTO | OtherShip | Toronto (Adelaide + Yorkville) | Mariana Tek | ✅ Working |
| OS_FLATIRON | OtherShip | NYC (Flatiron) | Mariana Tek | ✅ Working |
| OS_WILLIAMSBURG | OtherShip | Brooklyn (Williamsburg) | Mariana Tek | ✅ Working |
| MZ_MYRTLE | MindZero | Myrtle Beach, SC | Mariana Tek | ✅ Working |
| ST_YONGE | Sweat and Tonic | Toronto (Yonge) | Mariana Tek | ⚠️ No avail data |
| ST_FRONT | Sweat and Tonic | Toronto (Front) | Mariana Tek | ⚠️ No avail data |

---

## Studios Not Yet Added

### Momence Platform (Requires Separate Scraper)

| Studio | Location | Platform | Notes |
|--------|----------|----------|-------|
| Soul Plunge | La Jolla, CA | Momence | host_id: 37373 |
| Conscious Body Recovery | San Diego | Momence | boardId: 85694 |
| Conscious Body Recovery | Temecula | Momence | boardId: 76949 |

**Momence Limitation:** Only exposes binary availability (open/full), not spot counts.

---

## Edge Case: Late-Day Additions

**Issue Identified:** 2026-02-04  
**Status:** ⚠️ Requires Implementation

### The Problem

Current scraper starts from `SCRAPED_THROUGH + 1`, which can miss classes added after the scrape but before midnight.

**Example scenario:**
```
Feb 3, 11:00 PM: Scraper runs, captures Feb 3 classes
                 Sets SCRAPED_THROUGH = 2026-02-03

Feb 3, 11:30 PM: Studio adds new class for Feb 3 schedule

Feb 4, 11:00 PM: Scraper starts from Feb 4
                 ❌ Missed: The 11:30pm addition to Feb 3
```

### Analysis

**Scenarios:**

| Scenario | Risk Level | Impact |
|----------|-----------|--------|
| **Late additions** | ⚠️ HIGH | Studios add last-minute slots 11pm-midnight — LOST with current logic |
| **Spot count updates** | ⚠️ LOW | Minor — we have point-in-time snapshots, not tracking real-time changes |
| **Cancellations** | ✅ ACCEPTABLE | "Ghost" records actually valuable (shows schedule volatility) |
| **Reschedules** | ✅ ACCEPTABLE | Multiple time slots visible (tracks changes) |

### Proposed Solution: Re-scrape Last Date + Deduplicate

**Change:**
```python

# OLD: start_date = studio_config.scraped_through + timedelta(days=1)

# NEW: start_date = studio_config.scraped_through  # Re-scrape last date
```

**Add deduplication:**
```python
def generate_record_id(record):
    """Create unique composite key"""
    return f"{record['company']}_{record['location']}_{record['classDate']}_{record['time']}_{record['class']}"

def deduplicate_records(records):
    """Keep most recent record per unique ID"""
    seen = {}
    for record in records:
        record_id = generate_record_id(record)
        seen[record_id] = record  # Later record overwrites (handles spot updates)
    return list(seen.values())
```

**Unique key example:**
```
OtherShip_Williamsburg_2025-09-15_07:00 AM_Guided Down: Sound Immersion
```

### Impact Analysis

| Metric | Current | Proposed | Change |
|--------|---------|----------|--------|
| Data completeness | 95-98% | 99-100% | +2-5% |
| Scrape volume/day | ~600 classes | ~1,200 classes | +100% |
| Execution time | ~10 min | ~15 min | +50% |
| Disk usage growth | Minimal | +2-5% (dedup mitigates) | Minor |
| Late additions | ❌ Lost | ✅ Captured | ✅ Fixed |
| Spot count snapshots | One/day | Two/day | Bonus |

### Recommendation

**✅ IMPLEMENT** — Data quality justifies the overhead

**Rationale:**
1. **Real risk**: Studios DO add classes late in day (observed behavior)
2. **Acceptable cost**: 50% more execution time, minimal storage impact
3. **Data quality > efficiency**: Complete data more important than speed
4. **Bonus benefit**: Captures spot count updates (nice for fill rate trends)

### Alternative Considered: 2-Day Overlap

**Rejected:**
- Re-scrape last 2 days (SCRAPED_THROUGH - 1)
- 200% overhead vs 100%
- Overkill — 1-day overlap sufficient for this use case

## Single Source of Truth (SSOT)

**File:** `data-state.json`  
**Purpose:** Canonical state for all scraper data — replaces fragmented tracking across multiple files

**Auto-updated by:**
- Scraper (after each run): `./scripts/update-data-state.sh scraper`
- Cleanup pipeline (future): `./scripts/update-data-state.sh cleanup`
- Manual refresh: `./scripts/update-data-state.sh manual`

**Contains:**
- Per-studio: scrapedThrough dates, raw/processed record counts, date ranges
- File inventory: all JSON files with sizes and record counts
- Totals: aggregate stats
- Pipeline status: scraper/cleanup/merge process state

**Human-readable docs:** STATUS.md tables should be generated FROM data-state.json (not maintained separately)

---

## Key Files

### Scraper Code
- `/home/john/projects/superscaper/src/main.py` — Entry point (includes auto-merge)
- `/home/john/projects/superscaper/src/scraper.py` — Selenium scraper
- `/home/john/projects/superscaper/src/parser.py` — Data parser
- `/home/john/projects/superscaper/src/supabase_sync.py` — Supabase push logic
- `/home/john/projects/superscaper/scripts/merge_to_master.py` — Master.json merge script

### Configuration
- `/home/john/projects/superscaper/.env` — Studio configs & SCRAPED_THROUGH dates
- `/home/john/projects/superscaper/scraper.log` — Detailed run logs

### Data Files (Priority Order)
1. **`processed/master.json`** ⭐ — Canonical local dataset (auto-updated)
2. **Supabase `classes` table** — Production database (auto-synced)
3. `nvrbot_scrape_YYYYMMDD.json` — Daily scrape outputs

### Project Docs
- `~/.openclaw/workspaces/main/projects/nvrbot-scraper/STATUS.md` — This file
- `~/.openclaw/workspaces/main/skills/nvrbot-scraper/SKILL.md` — Skill automation docs
- `/home/john/projects/superscaper/CLAUDE.md` — Technical scraper docs

Last activity: 2026-04-02

Loading project...

Raw STATUS.md

# nvrbot-scraper **Status:** 🟢 Active **Phase:** Daily Scraping + Cleanup Pipeline Implemented **Last Activity:** 2026-03-11 --- ## Linear Metadata **Project ID:** `17c6d70c-e194-41f7-a7e9-644fd91a60b8` **Team ID:** `96b685fe-2252-47c5-97ee-273d8c484942` **Last Synced:** `2026-02-06T13:21:08.582Z` ## Overview Web scraper for extracting class schedule data from competitor wellness studios (sauna, cold plunge, meditation). Used for NVRMND competitive analysis — tracking capacity, instructors, fill rates, class types. **Source Code:** `/home/john/projects/superscaper` --- ## Scraper Status (2026-03-27) **Last Update:** 2026-03-27 19:51 PST **Status:** ✅ **ALL CURRENT - Daily Scraping Active** **Mode:** Daily forward scraping (nightly at 11pm PT) with auto-merge to master.json | Studio | Last Scraped | Status | Notes | |--------|--------------|--------|-------| | BD_CARLSBAD | 2026-03-27 | ✅ Current | Backfill complete Nov 2024 → Mar 2026 | | OS_ADELAIDE | 2026-03-27 | ✅ Current | Backfill complete (split from OS_TORONTO) | | OS_YORKVILLE | 2026-03-27 | ✅ Current | Backfill complete (split from OS_TORONTO) | | OS_FLATIRON | 2026-03-27 | ✅ Current | Backfill complete Nov 2024 → Mar 2026 | | BD_LIBERTY | 2026-03-27 | ✅ Current | Backfill complete Dec 2025 → Mar 2026 | | OS_WILLIAMSBURG | 2026-03-27 | ✅ Current | Backfill complete Nov 2024 → Mar 2026 | ### Daily Scraping **Last successful scrape:** 2026-03-27 19:51:00 PST **Records collected:** 121 new records (100% valid, 0 errors) **Output file:** `nvrbot_scrape_20260327.json` **Errors:** None **Breakdown by studio:** - BD Carlsbad: 22 classes - BD Liberty Station: 20 classes - OtherShip Adelaide: 17 classes - OtherShip Flatiron: 21 classes - OtherShip Williamsburg: 23 classes - OtherShip Yorkville: 18 classes ### Master Data **Location:** `/home/john/projects/superscaper/processed/master.json` **Total records:** ~99,729 **Date range:** 2021-07-05 → 2026-02-14 **File size:** ~78.9 MB **Last merged:** 2026-02-14 23:02 PST **Auto-merge:** ✅ Enabled (runs after each scrape) **Removed:** MZ_MYRTLE (no classes found on platform), ST_YONGE, ST_FRONT (no availability data exposed) **Spot Check Results (Jan 20 - Feb 2, 2026):** - 1,730 records scraped across 6 studios - 100% validation pass rate (0 errors) - Adelaide/Yorkville separation confirmed - Output format verified **Critical Bug Fixed (2026-02-03 18:20):** - ⚠️ **Data loss bug:** Scraper was overwriting same file on multiple runs per day ## Data Locations (Complete Inventory) ### Master Data (Single Source of Truth) | File | Records | Date Range | Size | Last Updated | |------|---------|------------|------|--------------| | **`processed/master.json`** | **99,488** | **2021-07-05 → 2026-02-13** | **78.8 MB** | **2026-02-14 09:52 PST** | ⭐ **This is the canonical local dataset** — auto-updated nightly after each scrape. ### Daily Scrape Files (superscaper directory) | File | Records | Date Range | Notes | |------|---------|------------|-------| | `nvrbot_scrape_20260213.json` | 392 | Feb 11-13, 2026 | Latest daily scrape | | `nvrbot_scrape_20241106.json` | 13,838 | May 2 → Nov 4, 2024 | Historical backfill | | `nvrbot_scrape_20260204.json` | 35,679 | Jan 1 → Feb 3, 2026 | 2025 full year backfill | **Path:** `/home/john/projects/superscaper/` ### Legacy Files (Historical) | File | Records | Date Range | Notes | |------|---------|------------|-------| | `BDfull.json` | 4,099 | May 2 → Nov 5, 2024 | BD legacy export | | `OSfull.json` | 5,679 | May 2 → Nov 5, 2024 | OS legacy export | **Note:** Legacy files superseded by `processed/master.json` ### Supabase (Production Database) **Database:** `classes` table at `ootepdsivzlhqhaielor.supabase.co` **Auto-sync:** ✅ Enabled (upsert after each nightly scrape) **Deduplication:** Composite key (studio_id + class_date + time + class) **Current records:** ~99,488+ (auto-updated nightly) ### Google Drive - NVRMND Central (Historical/Deprecated) **Shared Drive:** `NVRMND Central` (ID: `0AEKvqTBXqb6XUk9PVA`) **Location:** `Data Dump > JG > 04 Data & Exports > Active` | File | Type | ID | Records | Notes | |------|------|-----|---------|-------| | **nvrbot** | Google Sheet | `1oE9MkZzYA4FqODqi50cBQ5GlqBWYRHDoGXVm9tJIEGI` | 47,832 | ⚠️ **Not auto-updated** (historical) | | nvrbot_scrape_20241106.csv | CSV | `1hV5HUKt0M1DVyKH2qz2u3igJKLbFpKDi` | 13,838 | Raw backup | | BDfull | Google Sheet | `1OzW03kQu6SHsVrXdJQmj4qQZzxO_5X-QUj6xdf3rU_s` | — | BD cleaned | | OSfull.csv | CSV | `1BrFc6-1dBUe36OD-IWLffBpf7wQScaaE` | 5,679 | OS raw | | MZfull.csv | CSV | `12y6Q3iIeCbyus-YN36amqW_Yvbx0oGID` | ~2,500 | MZ raw (Jun 2024) | **Note:** Google Sheets are no longer auto-updated. Use Supabase or `processed/master.json` for current data. ## Existing Cleanup Schema (from nvrbot Google Sheet) ### Sheet Structure | Tab | Rows | Purpose | |-----|------|---------| | **fullData** | 47,832 | Main aggregated cleaned data | | **OS done** | 16,117 | OtherShip cleaned (filter: excludes Yorkville) | | **BD done** | 25,825 | Breathe Degrees cleaned (filter: excludes Free Flow/Social) | | **MZ done** | 3,076 | MindZero cleaned | | **nvrbot inputs** | 1,007 | Raw input staging | | **locationMap** | 7 | Location normalization lookup | | **timeMap** | ~100 | Time → military time lookup | | **UPDATE** | 11,044 | Recent update staging | | **new data** | 1,903 | New data staging | ### Cleaned Schema (fullData) | Column | Type | Example | Notes | |--------|------|---------|-------| | company | string | "Othership" | Normalized company name | | classDate | string | "01-31-2022" | MM-DD-YYYY format | | day | string | "Monday" | Day of week | | time | string | "3:00 PM" | 12-hour format | | duration | string | "75" | Minutes (as string) | | class | string | "Free Flow" | Class name | | location | string | "Adelaide" | Raw location name | | type | string | "Free Flow" | Class type category | | fill | string | "14/17 Open" | Combined availability string | | classStatus | string | "" | Status field | | open | string | "14" | Open spots (as string) | | total | string | "17" | Total spots (as string) | | filled | string | "3" | Filled spots (as string) | | instructor | string | "Othership Guide" | Primary instructor | | instructor2 | string | "" | Secondary instructor | | instructor3 | string | "" | Third instructor | | room | string | "Sauna" | Room name | | url | string | "https://..." | Source URL | | yoga | string | "" | Yoga class flag | | saunadotcnt | string | "1" | Counter field | ### Lookup Tables **locationMap:** | locationID | location | |------------|----------| | Adelaide | OtherShip - Adelaide | | Yorkville | OtherShip - Yorkville | | Carlsbad Studio | Breathe Degrees - Carlsbad | | Myrtle Beach | Mind Zero - Myrtle Beach | | Mt. Pleasant | Mind Zero - Mt. Pleasant | | Flatiron | OtherShip - Flatiron | **timeMap:** | time | milTime | hour | |------|---------|------| | 1:00 PM | 1300 | 13 | | 10:00 AM | 1000 | 10 | | ... | ... | ... | --- ## Second-Level Cleanup Spec (DRAFT) ### Goals 1. **Deduplicate** raw data (re-use scraper's deduplication logic) 2. **Normalize** raw scraper output to consistent schema 3. **Match** existing Google Sheet format for continuity 4. **Automate** what was previously done manually 5. **Add** new computed fields for analysis **Note on Deduplication:** The cleanup pipeline should use the same `generate_record_id()` and `deduplicate_records()` functions from `src/main.py`. This ensures consistency between scraper and cleanup deduplication logic. ### Input Format (Raw Scraper Output) ```json { "time": "8:00 AM", "duration": "75", "location": "Williamsburg", "class": "Guided Down: Senses", "instructor": "Becca Jacobs", "room": "Sauna", "classDate": "01-31-2026", "day": "Saturday", "open_spots": 58, "total_spots": 64, "filled_spots": 6, "status": "open", "type": "Class", "company": "OtherShip", "url": "https://..." } ``` ### Output Format (Cleaned) **Option A: Match Existing Google Sheet Schema** ```json { "company": "Othership", "classDate": "01-31-2026", "day": "Saturday", "time": "8:00 AM", "duration": "75", "class": "Guided Down: Senses", "location": "Williamsburg", "type": "Guided", "fill": "58/64 Open", "classStatus": "open", "open": "58", "total": "64", "filled": "6", "instructor": "Becca Jacobs", "instructor2": "", "instructor3": "", "room": "Sauna", "url": "https://...", "yoga": "", "saunadotcnt": "" } ``` **Option B: Enhanced Schema (New)** ```json { "id": "OS_WILLIAMSBURG_2026-01-31_0800", "studio_id": "OS_WILLIAMSBURG", "company": "OtherShip", "location_normalized": "OtherShip - Williamsburg", "date_iso": "2026-01-31", "date_display": "01-31-2026", "day_of_week": "Saturday", "time_12h": "8:00 AM", "time_24h": "08:00", "hour": 8, "duration_min": 75, "class_name": "Guided Down: Senses", "class_type": "Guided", "class_category": "Sauna", "instructor": "Becca Jacobs", "instructor_normalized": "becca_jacobs", "instructor2": null, "instructor3": null, "room": "Sauna", "open_spots": 58, "total_spots": 64, "filled_spots": 6, "fill_rate": 0.094, "fill_display": "58/64 Open", "status": "open", "is_waitlist": false, "is_full": false, "url": "https://...", "scraped_at": "2026-02-01T16:23:00Z" } ``` ### Transformations Required **Pipeline order:** (Run in sequence) | Step | Transformation | Complexity | Notes | |------|----------------|------------|-------| | **0. Deduplication** | Remove duplicate records | Low | **FIRST STEP** - Use scraper's `generate_record_id()` logic | | **1. Company** | Normalize capitalization | Low | "OtherShip" → "Othership" | | **2. Date** | Parse MM-DD-YYYY → ISO | Low | "01-31-2026" → "2026-01-31" | | **3. Time** | Parse to 24h, extract hour | Low | "8:00 AM" → "08:00", hour: 8 | | **4. Duration** | String → int | Low | "75" → 75 | | **5. Location** | Lookup → normalized name | Medium | "Williamsburg" → "OtherShip - Williamsburg" | | **6. Type** | Extract from class name | Medium | "Guided Down" → "Guided" | | **7. Instructor** | Split multiple, normalize | Medium | "Arkaya \| Elly" → instructor, instructor2 | | **8. Fill rate** | Calculate filled/total | Low | 6/64 → 0.094 | | **9. Class category** | Infer from room/class name | Medium | room: "Sauna" → category: "Sauna" | | **10. Yoga flag** | Pattern match class name | Low | "Yoga Flow" → yoga: "Y" | **Critical:** Deduplication MUST run first to prevent duplicate data in cleaned output. Subsequent steps operate on unique records only. ### Class Type Taxonomy (Needs Validation) Based on existing data patterns: | Type | Pattern | Examples | |------|---------|----------| | **Free Flow** | "Free Flow", "Open" | Open sessions, self-guided | | **Guided** | "Guided Down", "Guided Up", "Guided All Around" | Instructor-led sauna sessions | | **Class** | Specific class names | Yoga, HIIT, Breathwork | | **Social** | "Social" | Community events | | **Private** | "Private" | Private bookings | | **Online** | "Online", "Virtual" | Remote classes | ### Instructor Normalization **Challenges:** - Multiple instructors in one field: "Arkaya | Elly Ball" - Generic names: "Othership Guide", "Free Flow Guide" - Inconsistent formatting: "BECCA JACOBS" vs "Becca Jacobs" **Approach:** 1. Split on ` | ` delimiter 2. Title case normalization 3. Generate slug: "becca_jacobs" 4. Map generic guides to company defaults --- ## Possible Cleanup Approaches ### Approach 1: Python Script (Recommended) **Pros:** - Full control over transformations - Can run locally or in CI/CD - Easy to version control - Can output to multiple formats **Cons:** - Need to maintain code - Separate from Google Sheets workflow **Implementation:** ``` /projects/nvrbot-scraper/ ├── cleanup/ │ ├── transform.py # Main transformation logic │ ├── lookups.py # Location/time mappings │ ├── validators.py # Data quality checks │ └── output.py # JSON/CSV/Sheets export ``` ### Approach 2: Google Sheets Formulas **Pros:** - Matches existing workflow - Non-technical users can modify - Real-time updates **Cons:** - Complex formulas hard to maintain - Performance issues with large datasets - Version control difficult **Implementation:** - Import raw CSV to "inputs" tab - Use VLOOKUP for location mapping - Use formulas to compute derived fields - Copy-paste values to "done" tabs ### Approach 3: Hybrid (Script + Sheets) **Pros:** - Best of both worlds - Script handles heavy lifting - Sheets for final review/adjustments **Cons:** - Two systems to maintain - Data sync complexity **Implementation:** 1. Python script transforms raw → cleaned JSON 2. Script uploads to staging sheet 3. Manual review in Sheets 4. Append to master sheet ### Approach 4: Database Pipeline **Pros:** - SQL for analysis - Scales well - Can power dashboards **Cons:** - More infrastructure - Overkill for current volume **Implementation:** - SQLite or Postgres - Raw → staging → clean tables - Views for analysis --- ## Recommended Approach **Hybrid (Approach 3)** with Python script + Google Sheets integration: 1. **Script** does: - Load raw JSON/CSV - **Deduplicate records first** (re-use scraper's logic from `src/main.py`) - Apply all transformations - Validate data quality - Output cleaned JSON + CSV - Optionally push to Google Sheets staging tab 2. **Google Sheets** for: - Visual review - Manual corrections - Final append to master data - Pivot tables and analysis 3. **Automation** via: - Cron job for daily scrape - Cron job for daily cleanup - Alert on data quality issues --- ## Cleanup Pipeline Implementation ✅ **Status:** ✅ **IMPLEMENTED** (2026-02-07) **Location:** `/home/john/projects/superscaper/cleanup/transform.py` ### Implementation Details **Pipeline Steps:** 0. **Deduplication** (FIRST STEP - CRITICAL) - Uses composite key: `company_location_date_time_class` - Keeps LAST occurrence when duplicates found (most recent scrape) - Same logic as scraper (`src/main.py` functions) 1. Company name normalization 2. Room field cleaning 3. Instructor classification (named/generic) 4. Waitlist detection 5. Integer type enforcement 6. Fill rate calculation 7. Normalized location field 8. Time parsing (24h format + hour extraction) **Usage:** ```bash # JSON output only python cleanup/transform.py input.json output.json # JSON + CSV output python cleanup/transform.py input.json output.json --csv output.csv # Example with actual files cd /home/john/projects/superscaper ./venv/bin/python3 cleanup/transform.py nvrbot_scrape_20260206.json cleaned.json --csv cleaned.csv ``` **Testing Results:** | Test | Input Records | Duplicates Found | Output Records | Errors | Status | |------|---------------|------------------|----------------|---------|---------| | Small dataset (Feb 6) | 269 | 0 | 269 | 0 | ✅ Pass | | Synthetic duplicates | 279 | 10 | 269 | 0 | ✅ Pass | | Large dataset (Feb 4) | 35,679 | 5 | 35,674 | 0 | ✅ Pass | **Output Schema:** The cleaned output includes all original fields plus: - `instructor_type`: "named" or "generic" - `instructor_normalized`: slug format (e.g., "becca_jacobs") - `is_waitlist`: boolean flag - `fill_rate`: decimal (filled/total) - `location_normalized`: "Company - Location" - `time_24h`: 24-hour format (e.g., "08:00") - `hour`: integer 0-23 **Documentation:** - `/home/john/projects/superscaper/cleanup/README.md` - Updated with deduplication details - Pipeline follows spec in STATUS.md "Second-Level Cleanup Spec" **Next Steps:** 1. ~~Implement cleanup pipeline~~ ✅ DONE 2. Test on production data → ✅ DONE 3. Integrate with daily scraper workflow (optional automation) 4. Add Google Sheets export capability (future enhancement) 5. Schedule daily cleanup cron job (future automation) --- ## Data Quality Issues to Handle | Issue | Example | Solution | |-------|---------|----------| | **Duplicate records** | Same class scraped twice | **Dedupe on composite key** (company_location_date_time_class) - FIRST STEP in pipeline | | Missing availability | S&T shows 0/0/0 | Flag as "no_data", exclude from fill rate analysis | | Multiple instructors | "Arkaya \| Elly Ball" | Split to instructor1/2/3 | | Date format variations | "01-31-2026" vs "2026-01-31" | Normalize to ISO internally | | Location name changes | "Carlsbad Studio" vs "Carlsbad" | Lookup table normalization | | Class name variations | "Free Flow" vs "FreeFlow" | Fuzzy matching + manual mapping | | Timezone issues | EST vs PST studios | Store in local time with TZ indicator | --- ## Current Studios (8 total) | Studio | Company | Location | Platform | Status | |--------|---------|----------|----------|--------| | BD_CARLSBAD | Breathe Degrees | Carlsbad, CA | Mariana Tek | ✅ Working | | BD_LIBERTY | Breathe Degrees | Liberty Station, SD | Mariana Tek | ✅ Working | | OS_TORONTO | OtherShip | Toronto (Adelaide + Yorkville) | Mariana Tek | ✅ Working | | OS_FLATIRON | OtherShip | NYC (Flatiron) | Mariana Tek | ✅ Working | | OS_WILLIAMSBURG | OtherShip | Brooklyn (Williamsburg) | Mariana Tek | ✅ Working | | MZ_MYRTLE | MindZero | Myrtle Beach, SC | Mariana Tek | ✅ Working | | ST_YONGE | Sweat and Tonic | Toronto (Yonge) | Mariana Tek | ⚠️ No avail data | | ST_FRONT | Sweat and Tonic | Toronto (Front) | Mariana Tek | ⚠️ No avail data | --- ## Studios Not Yet Added ### Momence Platform (Requires Separate Scraper) | Studio | Location | Platform | Notes | |--------|----------|----------|-------| | Soul Plunge | La Jolla, CA | Momence | host_id: 37373 | | Conscious Body Recovery | San Diego | Momence | boardId: 85694 | | Conscious Body Recovery | Temecula | Momence | boardId: 76949 | **Momence Limitation:** Only exposes binary availability (open/full), not spot counts. --- ## Edge Case: Late-Day Additions **Issue Identified:** 2026-02-04 **Status:** ⚠️ Requires Implementation ### The Problem Current scraper starts from `SCRAPED_THROUGH + 1`, which can miss classes added after the scrape but before midnight. **Example scenario:** ``` Feb 3, 11:00 PM: Scraper runs, captures Feb 3 classes Sets SCRAPED_THROUGH = 2026-02-03 Feb 3, 11:30 PM: Studio adds new class for Feb 3 schedule Feb 4, 11:00 PM: Scraper starts from Feb 4 ❌ Missed: The 11:30pm addition to Feb 3 ``` ### Analysis **Scenarios:** | Scenario | Risk Level | Impact | |----------|-----------|--------| | **Late additions** | ⚠️ HIGH | Studios add last-minute slots 11pm-midnight — LOST with current logic | | **Spot count updates** | ⚠️ LOW | Minor — we have point-in-time snapshots, not tracking real-time changes | | **Cancellations** | ✅ ACCEPTABLE | "Ghost" records actually valuable (shows schedule volatility) | | **Reschedules** | ✅ ACCEPTABLE | Multiple time slots visible (tracks changes) | ### Proposed Solution: Re-scrape Last Date + Deduplicate **Change:** ```python # OLD: start_date = studio_config.scraped_through + timedelta(days=1) # NEW: start_date = studio_config.scraped_through # Re-scrape last date ``` **Add deduplication:** ```python def generate_record_id(record): """Create unique composite key""" return f"{record['company']}_{record['location']}_{record['classDate']}_{record['time']}_{record['class']}" def deduplicate_records(records): """Keep most recent record per unique ID""" seen = {} for record in records: record_id = generate_record_id(record) seen[record_id] = record # Later record overwrites (handles spot updates) return list(seen.values()) ``` **Unique key example:** ``` OtherShip_Williamsburg_2025-09-15_07:00 AM_Guided Down: Sound Immersion ``` ### Impact Analysis | Metric | Current | Proposed | Change | |--------|---------|----------|--------| | Data completeness | 95-98% | 99-100% | +2-5% | | Scrape volume/day | ~600 classes | ~1,200 classes | +100% | | Execution time | ~10 min | ~15 min | +50% | | Disk usage growth | Minimal | +2-5% (dedup mitigates) | Minor | | Late additions | ❌ Lost | ✅ Captured | ✅ Fixed | | Spot count snapshots | One/day | Two/day | Bonus | ### Recommendation **✅ IMPLEMENT** — Data quality justifies the overhead **Rationale:** 1. **Real risk**: Studios DO add classes late in day (observed behavior) 2. **Acceptable cost**: 50% more execution time, minimal storage impact 3. **Data quality > efficiency**: Complete data more important than speed 4. **Bonus benefit**: Captures spot count updates (nice for fill rate trends) ### Alternative Considered: 2-Day Overlap **Rejected:** - Re-scrape last 2 days (SCRAPED_THROUGH - 1) - 200% overhead vs 100% - Overkill — 1-day overlap sufficient for this use case ## Single Source of Truth (SSOT) **File:** `data-state.json` **Purpose:** Canonical state for all scraper data — replaces fragmented tracking across multiple files **Auto-updated by:** - Scraper (after each run): `./scripts/update-data-state.sh scraper` - Cleanup pipeline (future): `./scripts/update-data-state.sh cleanup` - Manual refresh: `./scripts/update-data-state.sh manual` **Contains:** - Per-studio: scrapedThrough dates, raw/processed record counts, date ranges - File inventory: all JSON files with sizes and record counts - Totals: aggregate stats - Pipeline status: scraper/cleanup/merge process state **Human-readable docs:** STATUS.md tables should be generated FROM data-state.json (not maintained separately) --- ## Key Files ### Scraper Code - `/home/john/projects/superscaper/src/main.py` — Entry point (includes auto-merge) - `/home/john/projects/superscaper/src/scraper.py` — Selenium scraper - `/home/john/projects/superscaper/src/parser.py` — Data parser - `/home/john/projects/superscaper/src/supabase_sync.py` — Supabase push logic - `/home/john/projects/superscaper/scripts/merge_to_master.py` — Master.json merge script ### Configuration - `/home/john/projects/superscaper/.env` — Studio configs & SCRAPED_THROUGH dates - `/home/john/projects/superscaper/scraper.log` — Detailed run logs ### Data Files (Priority Order) 1. **`processed/master.json`** ⭐ — Canonical local dataset (auto-updated) 2. **Supabase `classes` table** — Production database (auto-synced) 3. `nvrbot_scrape_YYYYMMDD.json` — Daily scrape outputs ### Project Docs - `~/.openclaw/workspaces/main/projects/nvrbot-scraper/STATUS.md` — This file - `~/.openclaw/workspaces/main/skills/nvrbot-scraper/SKILL.md` — Skill automation docs - `/home/john/projects/superscaper/CLAUDE.md` — Technical scraper docs