Skip to content

AEO Bunny — Deployment Runbook

Last updated: 2026-03-25 Scope: Ongoing deployment operations after the system is live

This file is the authoritative deployment guide for AEO Bunny (supersedes the retired PRODUCTION_DEPLOYMENT.md).


1. Overview

AEO Bunny runs across four services:

Component Platform What it does
Backend API Railway FastAPI app — pipeline orchestration, AI agents, all business logic
Frontend Portal Vercel Next.js app — admin dashboard and customer portal (same app, role-based routing)
Database Supabase PostgreSQL via Session Pooler — all persistent state, auth (JWKS/ECC P-256)
Asset Storage Cloudflare R2 Images, assembled HTML pages, ZIP bundles, avatars

External integrations (Anthropic, OpenAI, Perplexity, GHL, DataForSEO, Gemini, PageSpeed Insights) are called from the backend only.

graph TB
    subgraph "Clients"
        Browser["Browser (Admin + Customer)"]
    end

    subgraph "Vercel"
        Frontend["Next.js Portal<br/>portal.aireadyplumber.com"]
    end

    subgraph "Railway"
        Backend["FastAPI Backend<br/>api.aireadyplumber.com"]
    end

    subgraph "Supabase"
        DB["PostgreSQL<br/>(Session Pooler)"]
        Auth["Supabase Auth<br/>(JWKS)"]
    end

    subgraph "Cloudflare"
        R2["R2 Object Storage"]
    end

    subgraph "External APIs"
        Anthropic["Anthropic Claude"]
        OpenAI["OpenAI"]
        Perplexity["Perplexity Sonar"]
        GHL["GoHighLevel"]
    end

    Browser --> Frontend
    Frontend -->|API calls| Backend
    Browser -->|JWT auth| Auth
    Backend --> DB
    Backend --> R2
    Backend --> Anthropic
    Backend --> OpenAI
    Backend --> Perplexity
    Backend --> GHL

2. Release Workflow

A standard release follows this order: Database -> Backend -> Frontend. This ordering is critical.

Why this order matters

  1. Database first — Migrations add columns/tables the backend expects. If the backend deploys before the migration runs, it will fail on missing columns.
  2. Backend second — New API endpoints must be live before the frontend tries to call them. The backend must also be able to handle both old and new DB schemas during the migration window.
  3. Frontend last — The frontend calls backend endpoints. If it deploys first and calls a new endpoint that does not exist yet, users see errors.

Step-by-step

1. Merge PR to main (or prepare the commit on main)
2. Run pre-deploy checklist (Section 10)
3. Apply database migrations (Section 5)
4. Deploy backend to Railway (Section 3)
5. Verify backend health
6. Deploy frontend to Vercel (Section 4)
7. Run post-deploy checklist (Section 11)

For changes that only touch the frontend (UI tweaks, copy changes), you can skip steps 3-5 and deploy the frontend directly.

For changes that only touch the backend with no migration, skip step 3.


3. Backend Deployment (Railway)

How it deploys

Railway deploys from the GitHub repo. The service is configured with root directory aeo_bunny. Every push to the connected branch triggers an automatic build.

The build process: 1. Railway detects the Dockerfile in aeo_bunny/ 2. Builds the Docker image (Python 3.12-slim, installs system deps, pip requirements, Playwright Chromium) 3. Starts the container with: uvicorn app.main:app --host 0.0.0.0 --port 8080

Connection pool settings (SQLAlchemy + asyncpg + Supabase pooler)

For production on Railway with the Supabase Session Pooler, use these SQLAlchemy engine parameters:

pool_size=5, max_overflow=10, pool_recycle=180, pool_timeout=30, pool_pre_ping=True

pool_recycle=180 (3 minutes) prevents stale connections from being reused after the pooler's idle timeout. pool_pre_ping=True issues a lightweight check before each connection is handed out. If you see Connection reset errors under load, lower pool_size first — the Supabase free tier has a hard connection cap.

Zero-downtime deploys

Railway deploys with zero downtime by default. When a new deployment becomes active, the old one stays alive briefly for a configurable overlap period. Once the new deployment's health check passes, Railway routes traffic to it and sends SIGTERM to the old container.

  • Health check gating: Configure a health check endpoint (GET /health) in Railway service settings. Railway will only route traffic to the new deployment after it responds with HTTP 200.
  • Overlap period: Controlled by RAILWAY_DEPLOYMENT_OVERLAP_SECONDS service variable (default is a few seconds). Increase this if your app needs time to drain in-flight requests.
  • Graceful shutdown: On SIGTERM, the FastAPI/Uvicorn server stops accepting new connections and finishes in-flight requests before exiting. If it doesn't shut down in time, Railway sends SIGKILL.

Triggering a deploy

Automatic: Push to the connected branch (usually main). Railway picks up the commit and starts building.

Manual (Railway dashboard): 1. Go to the Railway project dashboard 2. Select the backend service 3. Click "Deploy" and choose the commit to deploy

Manual (CLI):

cd aeo_bunny
railway up

Environment variable management

Add or update env vars in the Railway dashboard under the service's "Variables" tab. Changes to env vars trigger an automatic redeploy.

# View current variables (requires railway CLI + linked project)
railway variables

# Set a variable
railway variables set KEY=value

For the full list of required and optional env vars, see docs/internal/SECRETS_INVENTORY.md.

Checking deploy status and logs

Dashboard: Railway project -> service -> "Deployments" tab shows build/deploy history with status.

CLI:

# Stream live logs
railway logs

# View recent logs
railway logs --tail 200

Health check verification

After deploy, confirm the backend is alive:

curl https://api.aireadyplumber.com/health
# Expected: {"status":"ok"}

The health endpoint is GET /health (no auth required). It returns {"status": "ok"} if the FastAPI app started successfully.

Monitoring for startup failures

Watch the first 30 seconds of logs after deploy. Common startup failures:

Symptom Likely cause
ModuleNotFoundError Missing dependency in requirements.txt
Connection refused on DB DATABASE_URL is wrong or Supabase is down
sqlalchemy.exc.OperationalError DB URL format issue (must use postgresql+asyncpg://)
Container exits immediately Syntax error in Python code (check build logs)
Port binding error Another process on port 8080 (unlikely on Railway)
Revision recovery warning Non-fatal — the startup sweeper failed but the app is still running
Scheduled scan loop starting not in logs SCAN_ENABLED is False, or the scan task crashed immediately

Scan scheduler background task

The app starts a long-lived asyncio.Task for scheduled visibility scans on every boot (scheduled_scan_loop in app/visibility/scan_scheduler.py). It ticks every hour, queries deployed projects, and fires scans for any project that is due based on the configured frequencies.

Startup log: "Scheduled scan loop starting (interval=3600s)" — confirms the task launched.

Graceful shutdown: On SIGTERM, the app calls request_stop(), which sets a stop event. The loop exits after at most 3600 seconds (or immediately if it is sleeping), and in-flight scans are allowed to finish. The app waits up to 30 seconds for the loop to stop before forcibly cancelling the task.

Disabling scans: Set SCAN_ENABLED=false in Railway variables. The loop will still start but each hourly tick will exit immediately without touching the DB.

Relevant env vars (all optional — Railway Variables tab):

Variable Default Description
SCAN_ENABLED true Master switch for scheduled visibility scans
SCAN_FREQUENCY_EARLY_DAYS 7 Days between scans during the first 30 days after deployment
SCAN_FREQUENCY_STEADY_DAYS 30 Days between scans after the first 30 days

These can also be overridden at runtime via the admin Settings page (DB-overlay), without requiring a redeploy.


4. Frontend Deployment (Vercel)

How it deploys

Vercel deploys from the GitHub repo with root directory set to portal. Every push to main triggers a production deployment. Pushes to other branches create preview deployments.

The build process: 1. Vercel detects next.config.ts 2. Runs npm install then npm run build (which runs next build) 3. Deploys the built Next.js app to Vercel's edge network

Preview deployments

Every pull request automatically gets a preview URL (e.g., https://portal-git-feature-branch.vercel.app). Use these to test frontend changes before merging to main.

Preview deployments use the same env vars as production by default. To use different values for preview, set them with the "Preview" environment target in Vercel's settings.

Environment variable management

Frontend env vars are set in the Vercel dashboard under the project's "Settings" -> "Environment Variables" tab.

The three required frontend variables:

NEXT_PUBLIC_API_URL=https://api.aireadyplumber.com
NEXT_PUBLIC_SUPABASE_URL=https://[ref].supabase.co
NEXT_PUBLIC_SUPABASE_ANON_KEY=eyJ...

All frontend env vars that are used in browser code must be prefixed with NEXT_PUBLIC_. Without this prefix, Next.js strips them from the client bundle.

Changing env vars in Vercel does NOT trigger an automatic redeploy. You must redeploy manually for the new values to take effect:

# From the portal directory
vercel --prod

Or trigger a redeploy from the Vercel dashboard.

Checking deploy status

Dashboard: Vercel project -> "Deployments" tab shows build history with status, build logs, and preview URLs.

CLI:

vercel ls  # List recent deployments


5. Database Migrations (Supabase + Alembic)

The backend uses Alembic for schema migrations. Migration files live in aeo_bunny/migrations/versions/. The env.py reads DATABASE_URL from app config and uses async SQLAlchemy.

Creating a new migration

cd aeo_bunny

# Auto-generate from model changes
alembic revision --autogenerate -m "description_of_change"

# Or create an empty migration for manual SQL
alembic revision -m "description_of_change"

Review the generated file in migrations/versions/ before applying. Auto-generated migrations can miss things or generate incorrect operations.

Testing migrations locally

# Start local Postgres (docker-compose in aeo_bunny/)
docker compose up -d

# Set local DATABASE_URL (the compose file exposes port 5433)
export DATABASE_URL="postgresql+asyncpg://postgres:postgres@localhost:5433/aeo_bunny"

# Apply all migrations
cd aeo_bunny
alembic upgrade head

# Verify the migration applied
alembic current

# Test rollback
alembic downgrade -1

# Re-apply to confirm idempotency
alembic upgrade head

Previewing migration SQL before applying

Always review what Alembic will execute before running against production:

cd aeo_bunny

# Show the SQL that would be executed (does NOT apply changes)
railway run alembic upgrade head --sql

This prints the raw SQL statements without executing them. Review for correctness, especially for data migrations or column type changes.

Applying migrations to production

cd aeo_bunny

# Link to your Railway project (one-time setup)
railway link

# Run migrations through Railway's environment (uses production DATABASE_URL)
railway run alembic upgrade head

# Verify current migration state
railway run alembic current

Alternatively, if you have direct access to the Supabase connection string:

cd aeo_bunny
DATABASE_URL="postgresql+asyncpg://postgres.[ref]:[password]@aws-0-[region].pooler.supabase.com:5432/postgres" \
  alembic upgrade head

Migration safety rules

These rules prevent data loss and downtime:

  1. Never drop a column that is still in use. Remove the code reference first, deploy, then drop the column in a later migration.
  2. Always add new columns as nullable (or with a server default). Adding a NOT NULL column without a default locks the table and fails if rows exist.
  3. Never rename a column in one step. Instead: add new column -> deploy code that writes to both -> backfill -> deploy code that reads from new only -> drop old column.
  4. Keep migrations small and focused. One logical change per migration file.
  5. Never modify a migration that has already been applied to production. Create a new migration instead.
  6. Test both upgrade and downgrade locally before applying to production.
  7. Back up before destructive migrations. Supabase offers point-in-time recovery, but a manual backup before a risky migration is cheap insurance.

Rolling back a migration

cd aeo_bunny

# Roll back the most recent migration
railway run alembic downgrade -1

# Roll back to a specific revision
railway run alembic downgrade <revision_id>

# Check where you are
railway run alembic current

# See migration history
railway run alembic history

Rollback limitations: - Data-only migrations (INSERT/UPDATE/DELETE) cannot be automatically undone if the downgrade function was not written - Dropped columns are gone unless you restore from backup - Always check the downgrade() function in the migration file before relying on it


6. Rollback Procedures

Backend rollback (Railway)

Railway keeps a history of deployments. To rollback:

  1. Dashboard: Go to the service's "Deployments" tab -> find the previous working deployment -> click the three-dot menu -> "Rollback"
  2. CLI: Push a revert commit, or use railway up with the previous commit checked out

A rollback restores both the Docker image AND custom variables from the previous deployment. This is fast (1-2 minutes) because Railway reuses the cached image — no rebuild needed.

Note: Deployments older than your plan's retention period cannot be restored via rollback.

Frontend rollback (Vercel)

Vercel makes rollback instant:

  1. Dashboard: Go to "Deployments" -> find the previous production deployment -> click the three-dot menu -> "Promote to Production"
  2. This is effectively instant (no rebuild, just re-routes traffic to the previous build)

Database rollback

Database rollbacks are the riskiest:

Scenario Can you rollback? How
Added a new column Yes alembic downgrade -1
Added a new table Yes alembic downgrade -1
Dropped a column No (data is gone) Restore from Supabase backup
Data migration (UPDATE/INSERT) Only if downgrade() was written alembic downgrade -1
Changed column type Depends on data loss Check the specific migration

When to rollback vs. hotfix forward

Rollback when: - The deploy is completely broken (500 errors, app won't start) - A regression affects a critical path (login, pipeline, payments) - The fix is not obvious and will take more than 30 minutes

Hotfix forward when: - The issue is minor (cosmetic, non-blocking) - You know exactly what the fix is - Rolling back would undo other important changes in the same deploy - The migration cannot be safely rolled back


7. Hotfix Procedure

For critical production issues that need an immediate fix:

# 1. Branch from main
git checkout main
git pull
git checkout -b hotfix/description-of-fix

# 2. Make the fix
# ... edit files ...

# 3. Run tests
cd aeo_bunny && python -m pytest
cd ../portal && npm run lint

# 4. Commit and push
git add <changed-files>
git commit -m "fix: description of the critical issue"
git push -u origin hotfix/description-of-fix

# 5. Merge to main (fast-track: skip PR review if severity warrants it)
git checkout main
git merge hotfix/description-of-fix
git push origin main

# 6. Railway and Vercel auto-deploy from main
# 7. Monitor logs and verify the fix (Section 9)

After the hotfix is deployed: - Verify the fix is working in production - If you skipped PR review, create a follow-up PR for documentation - Notify the team about what happened and what was changed - Check if the root cause needs a deeper fix


8. Environment Management

Production vs. staging

Currently there is no staging environment. The workflow is: - Local development with docker compose up (local Postgres on port 5433) - Vercel preview deployments for frontend changes (auto-created per PR) - Production on Railway + Vercel

Seeding a local or staging database with test data

Two scripts in aeo_bunny/scripts/ make it fast to populate a database with realistic test data spanning all pipeline stages:

cd aeo_bunny

# Seed 8 test projects (3 admin + 8 customer users, ~1094 records across 32 tables)
python scripts/seed_test_data.py

# Clean + reseed (idempotent — safe to run repeatedly)
python scripts/seed_test_data.py --force

# Drop and recreate all tables, then seed (nuclear option — local only)
python scripts/seed_test_data.py --reset-schema

# Remove all seeded data without touching other records
python scripts/clean_test_data.py

The seed script uses deterministic UUIDs (uuid5) so every run produces the same record IDs. All seed records are tagged so clean_test_data.py can remove them without affecting real data.

Important: Always run alembic upgrade head before seeding if the schema has changed since the last seed run. seed_test_data.py uses create_all() which adds missing tables but does not add missing columns — Alembic migrations are the authoritative schema source. See TD-063.

When a staging environment is added, it should mirror production with its own: - Railway service (same repo, different branch or manual deploy) - Vercel project or preview deployment pinned as "staging" - Supabase project (separate from production) - R2 bucket (separate from production)

Adding or changing env vars

Railway: 1. Dashboard -> Service -> Variables tab -> Add or edit 2. Save. Railway automatically redeploys with the new values. 3. Or via CLI: railway variables set KEY=value

Vercel: 1. Dashboard -> Project -> Settings -> Environment Variables -> Add or edit 2. Save. This does NOT trigger a redeploy. You must redeploy manually. 3. Or via CLI: vercel env add KEY (interactive) or set in dashboard

Secret rotation procedure

When rotating a secret (API key, DB password, webhook secret):

  1. Generate the new secret from the provider (Anthropic, Supabase, etc.)
  2. Update in Railway (and/or Vercel if it is a NEXT_PUBLIC_ key)
  3. Wait for the redeploy to complete and verify the app is healthy
  4. Revoke the old secret from the provider only after confirming the new one works
  5. Never revoke-then-update — this creates downtime

For DATABASE_URL rotation (Supabase password change): 1. Change the password in Supabase dashboard 2. Update DATABASE_URL in Railway with the new password 3. Wait for redeploy and verify health endpoint 4. The old password is automatically invalidated by Supabase

For GHL_WEBHOOK_SECRET rotation (legacy fallback only):

The Ed25519 migration is already complete. Inbound GHL webhooks are now authenticated via X-GHL-Signature (Ed25519) as the primary method, with GHL's public key hardcoded in app/api/ghl_inbound.py. The legacy X-GHL-Secret HMAC path is retained as a fallback for any webhooks that still arrive without the Ed25519 header. Secret rotation only applies to this legacy path:

  1. Set the new secret in both GHL and Railway simultaneously
  2. There may be a brief window where in-flight webhooks use the old secret — these will fail with 401 and GHL will retry
  3. Once all webhooks arrive with X-GHL-Signature, the GHL_WEBHOOK_SECRET env var can be removed entirely

9. Monitoring After Deploy

What to watch in Railway logs (first 5 minutes)

After every deploy, watch the logs for:

# Good signs:
"AEO Bunny started"              # App booted successfully
"Uvicorn running on 0.0.0.0:8080" # Server is listening

# Warning signs (non-fatal):
"Revision recovery sweeper failed on startup" # Startup task failed, app still works

# Bad signs (investigate immediately):
"Connection refused"               # DB connection failed
"ModuleNotFoundError"              # Missing dependency
Repeated 500 errors in logs        # Something is broken in request handling
"RateLimitExceeded" from providers # API keys may be invalid or exhausted

Health check endpoints to hit

# Backend health (no auth)
curl https://api.aireadyplumber.com/health
# Expected: {"status":"ok"}

# Frontend — just load these pages in a browser
https://portal.aireadyplumber.com/login
https://portal.aireadyplumber.com/admin

Common post-deploy failures

Symptom Cause Fix
502 from Railway App crashed on startup Check Railway build/deploy logs
Health returns 200 but API calls fail Missing env var for a specific feature Check Railway logs for the specific error
Frontend loads but API calls fail CORS issue — ALLOWED_ORIGINS does not include the portal domain Update ALLOWED_ORIGINS in Railway
Login fails Supabase auth config mismatch (URL, keys, redirect URLs) Verify all three Supabase vars match between frontend and backend
Pipeline triggers but immediately fails ANTHROPIC_API_KEY invalid or missing Check the key in Railway variables
GHL inbound webhooks rejected (401) Ed25519 signature verification failing, or legacy GHL_WEBHOOK_SECRET mismatch Check Railway logs for "Ed25519 signature verification failed" (note: this is logged at DEBUG level — set LOG_LEVEL=DEBUG to see it). For inbound: primary auth is X-GHL-Signature (Ed25519, public key in ghl_inbound.py); legacy fallback uses GHL_WEBHOOK_SECRET. For outbound: verify GHL_WEBHOOK_URL in Railway variables
Images/HTML not loading R2 credentials or R2_PUBLIC_URL wrong Check R2 vars in Railway
"Connection reset" from DB Supabase pooler overloaded or DATABASE_URL wrong Verify URL uses postgresql+asyncpg:// and Session Pooler
ProgrammingError: relation "system_broadcasts" does not exist Migration not applied Run railway run alembic upgrade head
Photo upload returns 500 photo_gate_passed column missing on pipeline_runs Run railway run alembic upgrade head

How to verify the pipeline still works

If the deploy touched pipeline code, trigger a test run:

  1. Log into the admin portal
  2. Find a test project (or create one via the onboarding endpoint)
  3. Trigger the pipeline
  4. Watch Railway logs for the pipeline progression through Phase A (BI + Strategy)
  5. Confirm it pauses at the expected operator gate

For a faster smoke test, check that the API responds to authenticated requests:

# Replace TOKEN with a valid JWT
curl -H "Authorization: Bearer TOKEN" \
  https://api.aireadyplumber.com/api/v1/admin/projects

10. Pre-Deploy Checklist

Copy this checklist before every deploy:

- [ ] All tests pass locally (`cd aeo_bunny && python -m pytest`)
- [ ] Frontend lints clean (`cd portal && npm run lint`)
- [ ] Migration tested on local DB (if applicable)
- [ ] Migration downgrade tested locally (if applicable)
- [ ] No new required env vars (or they have been added to Railway/Vercel BEFORE deploying)
- [ ] ALLOWED_ORIGINS updated if domain changed
- [ ] No breaking API changes (or frontend handles both old and new response shapes)
- [ ] CORS config unchanged (or updated in both backend and frontend)
- [ ] requirements.txt updated if new Python dependencies added
- [ ] portal/package.json updated if new JS dependencies added

11. Post-Deploy Checklist

Run through this after every production deploy:

- [ ] Backend health endpoint returns 200 (`curl https://api.aireadyplumber.com/health`)
- [ ] Admin dashboard loads (`/admin`)
- [ ] Customer portal loads (`/login`)
- [ ] Login works for admin role
- [ ] Login works for customer role
- [ ] Test pipeline trigger works (if pipeline code changed)
- [ ] GHL webhooks still delivering (check GHL dashboard for recent deliveries)
- [ ] Visibility score page loads (if visibility code changed)
- [ ] Readiness score page loads (if readiness code changed)
- [ ] File uploads work — R2 connectivity (if storage code changed)
- [ ] "Scheduled scan loop starting" appears in Railway logs after boot (if scan scheduler code changed)
- [ ] Broadcast banner: create a test broadcast from admin settings, verify it shows on customer dashboard, then deactivate (if broadcast code changed)
- [ ] Photo upload page loads at `/portal/photos` (if photo collection code changed)
- [ ] Responsive layout correct at 375px, 768px, and 1280px viewports (if frontend code changed)
- [ ] No `"Revision recovery sweeper failed on startup"` warning in Railway logs (if revision code changed — absence of any sweeper log is normal, it only logs when stale records are found or on failure)
- [ ] Readiness check triggers correctly via API (if readiness code changed — ReadinessEngine is instantiated on-demand, not at startup)
- [ ] No new errors in Railway logs (watch for 5 minutes)

12. Feature-Specific Deployment Notes

System Broadcast Banner

The system_broadcasts table backs the admin banner shown on all customer dashboards. It requires a migration before the backend will start cleanly if the table is absent.

Migration required: system_broadcasts table with columns id, message, is_active, created_by (FK → users.id), created_at, deactivated_at. Two constraints: a partial unique index enforcing at most one active broadcast at a time (uq_broadcast_active WHERE is_active = true) and a CHECK constraint (LENGTH(message) <= 500).

Broadcast API endpoints (super_admin only):

Method Path Purpose
POST /api/v1/admin/broadcast Create a new broadcast (atomically deactivates any existing one)
POST /api/v1/admin/broadcast/deactivate Deactivate the current broadcast
GET /api/v1/admin/broadcast Get current broadcast (admin view — all fields)
GET /api/v1/portal/broadcast Get current broadcast (customer view — message, id, created_at)

The customer GET /portal/broadcast endpoint is authenticated (any role). The admin endpoints require super_admin.

Post-deploy check: Create a test broadcast from the admin settings card and verify it appears on the customer dashboard. Deactivate it afterwards.


Photo Collection System

The photo upload system introduces a hard gate (GATE_PHOTO_UPLOAD) that blocks batch 1 HTML assembly until 100 photos are uploaded. A photo_gate_passed flag on PipelineRun skips the gate for batches 2-5.

Migration required: photo_gate_passed (Boolean) column on pipeline_runs. No new tables — photos are stored in the existing images table with a UniqueConstraint(location_id, original_filename) dedup guard.

Photo API endpoints:

Method Path Auth Purpose
POST /api/v1/portal/my-project/photos Customer Upload photos (all multipart files in a single request, 300-photo cap)
GET /api/v1/portal/my-project/photos Customer Paginated photo list with quality badges
GET /api/v1/portal/my-project/photo-status Customer Total count, gate status, quality breakdown
GET /api/v1/admin/projects/{location_id}/photo-status Admin Admin view of photo count, gate status, quality breakdown

Photo upload validates: magic bytes (JPEG/PNG/WEBP/GIF/AVIF), max 20 MB per file, minimum 300×300px. Duplicate filenames per location are silently skipped (counted as skipped_duplicates in the response, not rejected with 409). There is no per-request file count limit — the endpoint accepts all multipart files and processes them sequentially until the 300-photo cap is reached.

Gate auto-resume: When a photo upload pushes the total to 100 or above, the endpoint automatically resumes a pipeline paused at GATE_PHOTO_UPLOAD (CAS with SELECT FOR UPDATE). A GHL webhook photos_complete is dispatched.

Gate auto-resume details: When an upload pushes the cumulative count to 100 or above, the endpoint issues a SELECT FOR UPDATE on the pipeline_runs row and performs a CAS check: if status is "paused" or "queued" and current_step = GATE_PHOTO_UPLOAD and photo_gate_passed = false, it sets status = "queued" and photo_gate_passed = true in a single transaction, then fires run_pipeline as a background task via asyncio.create_task. If the CAS check fails (wrong status, wrong step, or flag already set), the upload still succeeds but no resume is triggered — preventing double-resume races.

Post-deploy check: Navigate to /portal/photos as a test customer and confirm the upload zone loads. Verify the photo count card appears on the customer dashboard.


Revision Foundation

The Revision Foundation adds per-article edit/revision tracking and a Haiku-powered dossier extractor that runs when a customer finishes reviewing a batch.

Migrations required:

Table What was added
batches operator_reviewed (Boolean, default false)
review_messages processed_in_round (Integer, nullable) — tracks which dossier extraction round consumed each message
articles review_status column (approved / edited / pending)
pipeline_runs revision_round (Integer, default 0)

If you are deploying with the batch pipeline already live, apply the migration and confirm all four columns exist before starting the backend.

Required env vars (Railway Variables):

Variable Default Description
REVISION_COST_THRESHOLD 1.0 USD cost ceiling for auto-approval. When AUTO_APPROVE_REVISIONS is true and estimated cost is at or below this threshold, revisions are auto-approved. Above it, revisions require operator review (the revision orchestrator rejects the plan with a cost-exceeded error).
AUTO_APPROVE_REVISIONS false If true, skip operator approval of revision plans
MAX_REVISION_ROUNDS 3 Maximum number of revision rounds allowed per batch

These are also overridable at runtime via the admin Settings page (DB-overlay), without requiring a redeploy.

Prompt template: The Haiku dossier extractor requires prompts/v1/dossier_extractor.md.j2 to be present in the deployed image. This file is committed to the repo and is included automatically at build time. If the Railway logs show FileNotFoundError: dossier_extractor.md.j2, the image was built from a stale working tree — force a clean rebuild.

GHL webhook events dispatched:

Event Trigger
revision_started When a revision is approved and execution begins (dispatched in revisions.py to both operator and customer audiences)
revision_complete After the revision executor finishes all article rewrites successfully (dispatched in revision_executor.py)
revision_failed After the revision executor completes with one or more article failures (dispatched in revision_executor.py)

Post-deploy check: Confirm processed_in_round column exists on review_messages and operator_reviewed exists on batches via railway run alembic current. The revision recovery sweeper runs on startup but only logs when it finds stale records ("Revision recovery: cleaned up N stale records") or fails ("Revision recovery sweeper failed on startup"). Absence of any sweeper log line is normal and means no stale records were found.


Readiness Score

The Readiness Score system measures website crawlability, schema presence, page speed, and structured data correctness. It runs automatically after onboarding and after each deployment confirmation.

Migrations required:

Table Purpose
readiness_scores One composite score record per location per check run (weighted 0-100)
readiness_checks Per-category sub-scores (crawlability, schema, speed, structured_data) with detail JSON

Both tables must be present before the backend starts — the API endpoints require them. There is no readiness scheduler; ReadinessEngine is instantiated on-demand when a check is triggered (via API, onboarding, or post-deployment).

Required env var:

Variable Notes
PAGESPEED_API_KEY Google PageSpeed Insights API key. If absent, the speed checker still runs but uses unauthenticated PSI API requests (lower rate limits, same functionality). Get a key from Google Cloud Console under the "PageSpeed Insights API" product for higher rate limits.

GHL webhook events dispatched:

Event Trigger
readiness_intake_complete After the fire-and-forget intake check completes during onboarding
readiness_critical On intake trigger only: when the composite score is below 40 OR the crawlability grade is "fail"
readiness_post_deploy_complete After the post-deployment readiness check completes

Readiness API endpoints:

Method Path Auth Purpose
POST /api/v1/readiness/{location_id}/check Admin (require_admin) Trigger an on-demand check
GET /api/v1/readiness/{location_id}/scores/latest Authenticated + location access Latest composite + category scores
GET /api/v1/readiness/{location_id}/scores/{score_id}/checks Authenticated + location access Per-category sub-checks for a specific score
GET /api/v1/readiness/{location_id}/scores Authenticated + location access Score trend over time

Post-deploy check: Confirm both readiness_scores and readiness_checks tables exist. If PAGESPEED_API_KEY is not set, the speed checker still makes API calls (PageSpeed Insights allows unauthenticated requests at lower rate limits) — it does not skip the category entirely.


Visibility Engine Weighting

Engine weights control how much each visibility adapter contributes to the composite score (0-100). By default all four engines have equal weight; operators can tune this to match the platforms that matter most for their customers.

Env var:

Variable Default Format
ENGINE_WEIGHTS "" (empty string — equal weights across all active engines) Optional JSON object — keys are engine names (chatgpt, perplexity, google_aio, gemini); values must be positive numbers (they are renormalized to sum to 1.0 automatically). When empty or absent, all active engines receive equal weight.

Example custom configuration that emphasises ChatGPT and Perplexity:

{"chatgpt": 0.35, "perplexity": 0.35, "google_aio": 0.15, "gemini": 0.15}

How dormant engines work: An engine is dormant when its credential env vars are absent or empty. Specifically: ChatGPT requires OPENAI_API_KEY, Perplexity requires PERPLEXITY_API_KEY, Google AIO requires DATAFORSEO_LOGIN + DATAFORSEO_PASSWORD, and Gemini requires GOOGLE_GEMINI_API_KEY. Dormant engines are excluded from build_adapters() and their weights are redistributed proportionally across active engines by _weighted_composite(). You do not need to update ENGINE_WEIGHTS when activating a new engine — its configured weight (or equal share if not explicitly set) is automatically included once the credentials are present.

Bad weight rejection: If ENGINE_WEIGHTS contains invalid JSON or a non-positive value, parse_engine_weights() logs a warning and falls back to equal weights. The app will not crash.

Overriding at runtime: ENGINE_WEIGHTS can also be set via the admin Settings page (DB-overlay, category: visibility), which takes effect without a redeploy.


Redis (Upstash) — Planned, Not Yet Wired

Upstash Redis (free tier, serverless) was decided in Phase 9b as the target Redis provider, but Redis is not yet wired into the application. Currently:

  • Rate limiting (slowapi) uses in-memory storage (Limiter(key_func=get_remote_address) with no storage_uri). Limits reset on each deploy and are not shared across multiple Railway instances.
  • Background work uses asyncio.create_task for pipeline runs, revision execution, visibility scans, and readiness checks. There is no Redis-backed task queue or pipeline state cache.

When Redis is wired (future): Set REDIS_URL to an Upstash connection string (rediss://default:[password]@[host].upstash.io:6379) and update the Limiter initialization to use Redis-backed storage. Until then, REDIS_URL is not read by any application code.

For production with multiple Railway instances: Be aware that in-memory rate limits are per-instance. A user hitting different instances can exceed the intended rate limit. This is acceptable for low-traffic launch but should be addressed before scaling.


Quick Command Reference

# === Railway (Backend) ===
railway logs                        # Stream live logs
railway logs --tail 200             # Recent logs
railway up                          # Manual deploy from local
railway variables                   # List env vars
railway variables set KEY=value     # Set env var (triggers redeploy)
railway run alembic upgrade head    # Apply migrations via Railway env
railway run alembic current         # Check current migration state
railway run alembic history         # View migration history
railway run alembic downgrade -1    # Rollback last migration

# === Vercel (Frontend) ===
vercel                              # Deploy to preview
vercel --prod                       # Deploy to production
vercel ls                           # List deployments
vercel env add KEY                  # Add env var (interactive)

# === Alembic (Local) ===
cd aeo_bunny
alembic revision --autogenerate -m "description"  # Generate migration
alembic revision -m "description"                  # Empty migration
alembic upgrade head                               # Apply all
alembic downgrade -1                               # Rollback one
alembic current                                    # Show current
alembic history                                    # Show history

# === Local Dev ===
cd aeo_bunny
docker compose up -d                # Start local Postgres (port 5433)
docker compose down                 # Stop local Postgres
python -m pytest                    # Run backend tests
cd ../portal && npm run dev         # Start frontend dev server
cd ../portal && npm run lint        # Lint frontend