Skip to content

Incident Runbook

Troubleshooting recipes for the most common production issues. Organized by symptom so you can find the right fix fast.

Audience: Operators (non-technical) and the founder (technical escalation).


How to Access Railway Logs

Most incidents require checking logs. Here is how to get to them:

  1. Go to railway.com and open the AEO Bunny project.
  2. Click the backend service (the FastAPI app).
  3. Go to Deployments and click the active (green) deployment.
  4. Click View Logs to see live output. Scroll up for recent history.

Or from the terminal:

railway logs --tail 200

Logs are your first stop for any issue marked "Technical."


Incident 1: Customer Can't Log In

Severity: Medium Who can fix: Operator (first checks) / Technical (if Supabase issue)

Steps:

  1. Ask the customer to confirm they are using the correct email address.
  2. Check the Supabase dashboard (Authentication > Users). Search for the customer's email.
  3. If the user does not exist in Supabase Auth, the purchase webhook may not have fired. Check Incident 9 (GHL Webhook Not Firing).
  4. If the user exists in Supabase Auth, check the app users table in the Supabase SQL Editor:
    SELECT id, supabase_id, email, role, is_active
    FROM users
    WHERE email = 'customer@example.com';
    
  5. If no row exists, the onboarding step was not completed. The customer needs to complete the onboarding form at /onboard?email=....
  6. If is_active is false, the account was deactivated. Set it to true if appropriate.
  7. If the supabase_id does not match the Supabase Auth UUID, the records are out of sync. Update the supabase_id column to match.
  8. Check that Supabase redirect URLs include the production portal domain. In the Supabase dashboard, go to Authentication > URL Configuration and confirm these are listed:
  9. https://portal.aireadyplumber.com/**
  10. https://portal.aireadyplumber.com/login
  11. https://portal.aireadyplumber.com/onboard
  12. https://portal.aireadyplumber.com/reset-password
  13. If the customer sees "Invalid authentication token," their JWT may be expired. Ask them to log out and log back in. If it persists, check Railway logs for JWKS lookup failed errors, which indicate the backend cannot reach Supabase's JWKS endpoint (network issue).
  14. If still broken, escalate to the founder with the customer's email and the error message they see.

Incident 2: Customer Didn't Receive an Email

Severity: Low Who can fix: Operator

Steps:

  1. Ask the customer to check their spam/junk folder. GHL-routed emails often land there on first contact.
  2. Check the GHL webhook delivery logs in GoHighLevel:
  3. Go to the automation that handles the relevant event (e.g., password_reset_requested, approval_needed).
  4. Confirm the automation is active (not paused or draft).
  5. Check the execution history for the customer's contact. Look for failed deliveries.
  6. Check Railway logs for the specific webhook event:
  7. Search for GHL webhook sent: password_reset_requested (or whatever event the customer expected).
  8. If you see GHL webhook failed for event ... (non-fatal), the backend tried to send but GHL rejected it. Check GHL_WEBHOOK_URL in Railway env vars.
  9. If you see GHL webhook URL not configured, skipping event, the GHL_WEBHOOK_URL environment variable is not set.
  10. For password reset specifically:
  11. Confirm PORTAL_BASE_URL is set correctly in Railway (should be https://portal.aireadyplumber.com).
  12. The reset link is generated by the backend and sent via GHL webhook, not via Supabase email templates.
  13. If GHL automations are correctly configured and the backend shows successful webhook sends, the issue is on the GHL email delivery side. Check the customer's GHL contact record for email delivery status.

Incident 3: Pipeline Stuck at a Quality Gate

Severity: Medium Who can fix: Operator

Steps:

  1. Open the admin dashboard and go to the Approvals page. Look for projects showing "Paused" status.
  2. Identify which gate the pipeline is paused at. Common gates:
  3. gate_bi_review -- BI research needs operator review (Phase A).
  4. gate_matrix_review -- Content strategy matrix needs operator review (Phase A).
  5. gate_batch_article_review -- Batch articles need review. Batch 1 goes to operator first; batches 2-5 go to customer.
  6. gate_batch_html_review -- Assembled HTML pages need review. Same routing as article review.
  7. gate_batch_deploy_confirm -- Batch is ready to ship and awaiting deployment confirmation.
  8. Click into the project and review the content at the gate.
  9. Click Approve to resume the pipeline, or Reject with a note explaining what needs to change.
  10. If the gate is a customer gate (batches 2-5), the customer must take action from their portal. Contact them via GHL to remind them there is content awaiting their review.
  11. This is expected behavior. The pipeline is designed to pause at quality gates for human review. A project sitting at a gate is not broken -- it is waiting for someone to act.

Incident 4: Pipeline Failed Mid-Run

Severity: High Who can fix: Technical (founder)

Steps:

  1. Check Railway logs for the error. Search for Pipeline failed for location. The log will include the traceback.
  2. Check the pipeline_runs table for the error message:
    SELECT id, location_id, status, current_step, current_batch_number,
           error_message, attempts
    FROM pipeline_runs
    WHERE location_id = 'LOCATION-UUID-HERE';
    
  3. Identify the failure cause from the error message. Common causes:
Error pattern Cause Fix
anthropic.RateLimitError Anthropic API rate limit hit Wait 1-2 minutes, then retry. If persistent, check API tier/usage.
asyncpg.ConnectionDoesNotExist or connection is closed Supabase pooler dropped the connection Retry. If frequent, check Supabase pooler health and connection limits.
ValidationError / pydantic Agent returned unexpected output format Check the prompt template. May need a retry or prompt fix.
Article writing failed for categories One or more parallel article writes failed Check which category failed in the traceback. Usually a rate limit or validation error.
No DataCard found or No BI brief Data prerequisites missing The BI step may not have completed. Check if the DataCard exists.
httpx.ConnectTimeout External API (Anthropic, OpenAI, etc.) unreachable Check service status pages. Retry when service recovers.
  1. To retry the pipeline via the admin API:
    curl -X POST https://api.aireadyplumber.com/api/v1/projects/LOCATION-UUID/retry \
      -H "Authorization: Bearer YOUR-ADMIN-TOKEN"
    
    Or click the Retry button on the project detail page in the admin dashboard.

The pipeline will resume from where it left off (crash recovery). It will not re-do completed steps. Maximum retry attempts is controlled by MAX_PIPELINE_ATTEMPTS (default: 3).

  1. If the batch itself failed, the batches table will show status = "failed" with failed_at_status indicating where it was when the failure occurred. The retry remaps the batch status for recovery: writing is restored to pending (re-runs from scratch), and building_html is restored to article_review (re-runs from the review gate). Other statuses are restored as-is.

  2. If the error is a code bug or prompt issue that cannot be fixed by retrying, escalate and fix the root cause before retrying.


Incident 5: Visibility Scan Shows 0 After Deployment

Severity: Medium Who can fix: Technical (founder)

Steps:

  1. Confirm the deployment record reached "confirmed" status:
    SELECT id, status, hub_page_url, visibility_score_id
    FROM deployments
    WHERE location_id = 'LOCATION-UUID-HERE'
    ORDER BY created_at DESC
    LIMIT 1;
    
  2. If status is "pending", the customer has not confirmed deployment yet.
  3. If status is "measurement_failed", the scan ran but errored (see step 3).
  4. If status is "confirmed" with no visibility_score_id, the background task may not have started.

  5. Verify the API keys are valid in Railway environment variables:

  6. OPENAI_API_KEY -- required for ChatGPT visibility checks.
  7. PERPLEXITY_API_KEY -- required for Perplexity visibility checks.
  8. Test them quickly:

    # OpenAI
    curl https://api.openai.com/v1/models -H "Authorization: Bearer sk-..."
    # Perplexity
    curl https://api.perplexity.ai/chat/completions \
      -H "Authorization: Bearer pplx-..." \
      -H "Content-Type: application/json" \
      -d '{"model":"sonar","messages":[{"role":"user","content":"test"}]}'
    

  9. Check Railway logs for Visibility scan failed for deployment. The traceback will indicate the specific failure:

  10. No visibility adapters configured -- API keys are missing or empty.
  11. No BI brief found -- The DataCard has no BI brief; the BI step may not have completed.
  12. No test prompts in BI brief -- The BI Agent did not generate test prompts.
  13. Rate limit or timeout errors from OpenAI/Perplexity -- wait and re-trigger.

  14. Verify the hub_page_url is publicly accessible. The visibility engines need to be able to find the business online. Try loading the URL in a browser.

  15. To re-trigger measurement (rate-limited to 1/hour per customer):

    curl -X POST https://api.aireadyplumber.com/api/v1/portal/my-project/measure-visibility \
      -H "Authorization: Bearer CUSTOMER-TOKEN"
    
    Or the customer can click "Measure Visibility" in their portal.


Incident 6: Visibility Scan Stuck in "Measuring"

Severity: Low Who can fix: Technical (founder)

Steps:

  1. Check the deployment status:

    SELECT id, status, created_at, updated_at
    FROM deployments
    WHERE location_id = 'LOCATION-UUID-HERE'
    ORDER BY created_at DESC
    LIMIT 1;
    
    If status = "measuring" and updated_at is more than 10 minutes ago, the background task likely timed out or crashed silently.

  2. Check Railway logs for Visibility scan failed or asyncio.TimeoutError. The scan has a 10-minute timeout (asyncio.wait_for(timeout=600.0)).

  3. Manual resolution: Reset the deployment status and re-trigger:

    UPDATE deployments
    SET status = 'measurement_failed'
    WHERE id = 'DEPLOYMENT-UUID-HERE'
    AND status = 'measuring';
    
    Then trigger a new measurement via the customer portal or API (see Incident 5, step 5).

  4. If this happens repeatedly, check:

  5. Are the OpenAI/Perplexity APIs responding slowly? The per-adapter timeout may need adjustment.
  6. Is the backend under heavy load? Multiple concurrent visibility scans are guarded by a semaphore, but extreme load could cause queuing.

Incident 7: Customer Sees Wrong Content or Missing Images

Severity: Medium Who can fix: Technical (founder)

Steps:

  1. Identify which article or page has the issue. Get the article slug and location ID from the customer or admin dashboard.

  2. Check the R2 bucket for the correct files:

  3. HTML files are stored at: {location_id}/html/{slug}.html
  4. Images are stored at: {location_id}/images/{filename}
  5. Use the R2 dashboard (Cloudflare) or the AWS CLI with R2 credentials to list/view files:

    aws s3 ls s3://aeo-bunny-images/{location_id}/html/ \
      --endpoint-url https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com
    

  6. Check the R2_PUBLIC_URL environment variable in Railway. The HTML output references images using this base URL. If it is wrong or expired, images will 404.

  7. Open the HTML file and check the image src attributes. They should point to {R2_PUBLIC_URL}/{location_id}/images/{filename}. If the paths are wrong, the HTML assembly step may have used incorrect configuration.

  8. To fix: You may need to re-run the HTML assembly step for the affected batch. This requires:

  9. Setting the batch status back to the pre-HTML-assembly state.
  10. Re-triggering the pipeline to rebuild HTML for that batch.
  11. This is a technical operation -- work with the codebase directly.

  12. If a single image is missing from R2 but the HTML is correct, re-upload the image to the correct key in R2.


Incident 8: ZIP Download Fails or Is Empty

Severity: Medium Who can fix: Technical (founder)

Steps:

  1. Check which batch the customer is trying to download. Get the batch number and location ID.

  2. Verify articles exist for that batch:

    SELECT id, title, batch_number, body_html IS NOT NULL AS has_html
    FROM articles
    WHERE location_id = 'LOCATION-UUID-HERE'
    AND batch_number = BATCH_NUMBER;
    

  3. If no articles exist for the batch, the writing step did not complete.
  4. If has_html is false for some articles, the HTML assembly step did not finish.

  5. Check R2 for the ZIP file:

  6. ZIP files are stored at: {location_id}/batch-{batch_number}.zip.
  7. Verify it exists and is not 0 bytes.

  8. Check Railway logs for errors during ZIP generation. Search for the location ID around the time the batch was built.

  9. Check R2 connectivity from the backend:

  10. Look for R2 or CloudStorageClient errors in Railway logs.
  11. Verify R2_ACCOUNT_ID, R2_ACCESS_KEY_ID, and R2_SECRET_ACCESS_KEY are set and valid.

  12. To re-trigger ZIP generation: The ZIP is built as part of the batch pipeline flow. If the batch shows status = "ready_to_ship" but the ZIP is missing or broken, you may need to manually rebuild it. Check the batches table for the zip_url value.

  13. If R2 is unreachable entirely (all downloads failing), check:

  14. Cloudflare R2 service status.
  15. Whether the R2 API token has expired or been revoked.

Incident 9: GHL Webhook Not Firing

Severity: Low Who can fix: Operator (first checks) / Technical (if config issue)

Steps:

  1. Check Railway logs for GHL webhook messages:
  2. GHL webhook sent: EVENT_NAME (audience=..., status=200) -- webhook sent successfully.
  3. GHL webhook failed for event EVENT_NAME (non-fatal) -- webhook attempted but failed. The pipeline continues regardless.
  4. GHL webhook URL not configured, skipping event -- GHL_WEBHOOK_URL is not set.

  5. Verify environment variables in Railway:

  6. GHL_WEBHOOK_URL -- must be set to the GHL webhook endpoint (e.g., https://services.leadconnectorhq.com/hooks/...).
  7. Inbound webhook auth now uses Ed25519 signature verification via X-GHL-Signature (primary). GHL_WEBHOOK_SECRET is legacy fallback only.

  8. Check the GHL side:

  9. Is the webhook endpoint still active in GHL? Endpoints can be disabled or deleted.
  10. Is the automation attached to the webhook active?
  11. Check GHL's webhook logs for incoming payloads.

  12. GHL webhook failures are non-fatal. The pipeline, deployments, and all other operations continue normally even if webhooks fail. Webhooks are notification-only. The backend catches all webhook exceptions and logs them as warnings.

  13. If webhooks were not firing for a period and you need to resend notifications, you can check the notifications table for the events that should have been sent and manually trigger the corresponding GHL automations.


Incident 10: Admin Dashboard Not Loading

Severity: Medium Who can fix: Operator (first checks) / Technical (if backend issue)

Steps:

  1. Check Vercel deployment status:
  2. Go to vercel.com and open the portal project.
  3. Check if the latest deployment succeeded (green checkmark) or failed (red X).
  4. If the deployment failed, check the build logs for errors.

  5. Open the browser developer console (F12 or right-click > Inspect > Console tab). Look for:

  6. Red error messages, especially CORS errors or Failed to fetch.
  7. 401 Unauthorized -- your session may have expired. Log out and log back in.
  8. net::ERR_CONNECTION_REFUSED -- the backend is down.

  9. Verify NEXT_PUBLIC_API_URL in Vercel environment variables points to the correct backend:

  10. Should be: https://api.aireadyplumber.com
  11. If it points to localhost or a staging URL, the frontend cannot reach the production backend.

  12. Check if the backend health endpoint responds:

    curl https://api.aireadyplumber.com/health
    

  13. If it returns {"status":"ok"}, the backend is alive and the issue is frontend-only.
  14. If it times out or returns an error, see Incident 11 (502 Bad Gateway).

  15. Check that NEXT_PUBLIC_SUPABASE_URL and NEXT_PUBLIC_SUPABASE_ANON_KEY in Vercel match the Supabase project. Mismatched keys will cause auth failures.

  16. Try loading the page in an incognito/private window to rule out browser cache or extension issues.

  17. If the dashboard loads but shows no data, the API calls may be failing silently. Check the browser Network tab for failed requests (red entries).


Incident 11: "502 Bad Gateway" on the API

Severity: High Who can fix: Technical (founder)

Steps:

  1. Check Railway deployment status:
  2. Go to railway.com and open the backend service.
  3. Is the deployment active (green) or crashed (red)?
  4. If crashed, click into the deployment and check the logs for the crash reason.

  5. Check Railway logs for startup errors. Common causes:

Error pattern Cause Fix
ModuleNotFoundError Missing dependency Check requirements.txt or Docker build. Redeploy.
sqlalchemy.exc.ArgumentError: Could not parse DATABASE_URL is malformed The code accepts postgres://, postgresql://, and postgresql+asyncpg:// schemes and normalizes them automatically. If this error appears, the URL has a deeper format issue (missing host, bad encoding, etc.).
Connection refused on port 5432 Cannot reach Supabase DB Check Supabase status, verify DATABASE_URL.
alembic.util.exc.CommandError Database schema out of date Run railway run alembic upgrade head.
RuntimeError: no running event loop Async setup issue Usually a code bug. Check recent commits.
Address/port binding error Railway port configuration Ensure the app binds to 0.0.0.0:$PORT.
  1. Run database migrations if needed:

    cd aeo_bunny
    railway run alembic upgrade head
    

  2. Verify the DATABASE_URL format. The code accepts postgres://, postgresql://, and postgresql+asyncpg:// schemes and normalizes them automatically to postgresql+asyncpg://. Example:

    postgresql+asyncpg://postgres.[ref]:[password]@aws-0-[region].pooler.supabase.com:5432/postgres
    
    Common mistakes:

  3. Using the direct connection string instead of the Session Pooler string.
  4. Password containing special characters that are not URL-encoded.

  5. If the service crashed and won't restart:

  6. Try a manual redeploy from the Railway dashboard.
  7. Check if Railway has a resource limit issue (memory/CPU).
  8. Check if a recent code push introduced a breaking change.

  9. If the backend starts but immediately crashes under load, check for:

  10. Database connection pool exhaustion (too many concurrent requests).
  11. Memory issues from large pipeline runs.

Incident 12: Pipeline Stuck at Photo Gate

Severity: Medium Who can fix: Operator (first checks) / Technical (if auto-resume failed)

Steps:

  1. Check the photo count for the customer's location:
    SELECT COUNT(*) FROM images WHERE location_id = 'LOCATION-UUID-HERE';
    
  2. If count < 100, the customer has not uploaded enough photos. The pipeline is correctly waiting.
  3. If count >= 100, the auto-resume logic should have fired but did not (see step 3).

  4. If count < 100: Contact the customer via GHL and remind them to upload at least 100 photos at /portal/photos. Verify the photos_upload_needed webhook was delivered (check GHL automation history for that contact).

  5. If count >= 100 but pipeline has not resumed: Check the photo_gate_passed flag:

    SELECT photo_gate_passed, status, current_step
    FROM pipeline_runs
    WHERE location_id = 'LOCATION-UUID-HERE';
    
    If photo_gate_passed = FALSE despite enough photos, the CAS-based auto-resume failed (race condition or backend error during the upload that hit the threshold). Manually set the flag and re-queue:
    UPDATE pipeline_runs
    SET photo_gate_passed = TRUE, status = 'queued', current_step = 'gate_photo_upload'
    WHERE location_id = 'LOCATION-UUID-HERE';
    
    Then trigger the pipeline via the admin API:
    # Pipeline resumes via the gate approval endpoint:
    curl -X POST https://api.aireadyplumber.com/api/v1/admin/projects/LOCATION-UUID/gate \
      -H "Authorization: Bearer YOUR-ADMIN-TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"action": "approve"}'
    

  6. Note: photo_gate_passed = TRUE on the pipeline run means the gate has been cleared and will not re-block batches 2-5. Only batch 1 HTML assembly is held by this gate.


Incident 13: Revision Execution Failed

Severity: High Who can fix: Technical (founder) / Operator (manual approval/rejection)

Steps:

  1. Identify the stuck revision. Check for BatchRevision records in executing status with a stale updated_at:

    SELECT id, batch_id, status, estimated_cost, updated_at
    FROM batch_revisions
    WHERE status = 'executing'
    AND updated_at < NOW() - INTERVAL '30 minutes';
    

  2. Check the revision_instructions table for the failed revision to see individual article-level failures:

    SELECT ri.id, ri.article_id, ri.execution_status, ri.error_detail
    FROM revision_instructions ri
    WHERE ri.batch_revision_id = 'BATCH-REVISION-UUID-HERE'
    AND ri.execution_status = 'failed';
    
    The error_detail field will contain the specific error (rate limit, validation error, DB error).

  3. Check Railway logs around the time updated_at was last updated. Search for revision or the batch revision ID to find the traceback.

  4. To retry: Use the admin retry endpoint:

    curl -X POST https://api.aireadyplumber.com/api/v1/admin/revisions/REVISION-UUID/retry \
      -H "Authorization: Bearer YOUR-ADMIN-TOKEN"
    

  5. If retry is not available or keeps failing: Manually mark the revision as failed to unblock the batch, then let the operator re-approve when ready:

    UPDATE batch_revisions
    SET status = 'failed'
    WHERE id = 'BATCH-REVISION-UUID-HERE'
    AND status = 'executing';
    
    The customer can then re-submit feedback and a new revision cycle can begin.

  6. If a single revision_instruction record failed but others succeeded, the partial failure may leave some articles unrevised. Review those articles individually and decide whether to re-trigger or manually update.


Incident 14: Revision Auto-Approval Bypassed (Cost Over Threshold)

Severity: Low Who can fix: Operator

Steps:

  1. A BatchRevision enters status = 'pending_approval' even when AUTO_APPROVE_REVISIONS=true. This is expected behavior when the estimated_cost exceeds REVISION_COST_THRESHOLD. It is not a bug — it is a cost-control gate.

  2. Verify the cost threshold is triggering correctly:

    SELECT id, batch_id, status, estimated_cost, created_at
    FROM batch_revisions
    WHERE status = 'pending_approval'
    ORDER BY created_at DESC
    LIMIT 5;
    
    Compare estimated_cost against the REVISION_COST_THRESHOLD setting (check Railway env vars or the Settings page in the admin dashboard).

  3. Review the revision instructions to understand why the cost estimate is high. A large batch of articles with extensive revision notes will produce a higher estimate.

  4. Operator action: Go to the admin dashboard, open the project, find the pending revision, and either:

  5. Approve — the revision proceeds and articles are rewritten.
  6. Reject — the revision is cancelled. The customer can submit new feedback with a narrower scope.

  7. If the threshold is consistently too low for your typical revision scope, adjust REVISION_COST_THRESHOLD in Railway environment variables or via the Settings page (super_admin only).


Incident 15: Readiness Check Stuck in "Running"

Severity: Medium Who can fix: Technical (founder)

Steps:

  1. Check for readiness scores stuck in running status:

    SELECT id, location_id, trigger, status, created_at, updated_at
    FROM readiness_scores
    WHERE status = 'running'
    AND updated_at < NOW() - INTERVAL '10 minutes';
    

  2. Check Railway logs for Readiness engine failed for location or httpx.ReadTimeout. The engine has per-checker HTTP timeouts. If the Google PageSpeed Insights API is slow or rate-limiting, the speed checker can block progress.

  3. Verify the PAGESPEED_API_KEY environment variable is set and valid in Railway. The speed checker still runs without an API key (using the unauthenticated PageSpeed Insights endpoint), but unauthenticated requests have stricter rate limits and may produce quota errors or timeouts under load.

  4. Check if the target website is blocking the readiness engine's HTTP requests. Some websites block automated HTTP clients, causing crawlability checker timeouts.

  5. Manual resolution: Reset the stuck score to failed and re-trigger:

    UPDATE readiness_scores
    SET status = 'failed', error_message = 'Manual reset: was stuck in running'
    WHERE id = 'SCORE-UUID-HERE'
    AND status = 'running';
    
    Then re-trigger via the admin API:
    curl -X POST https://api.aireadyplumber.com/api/v1/readiness/LOCATION-UUID/check \
      -H "Authorization: Bearer YOUR-ADMIN-TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"trigger": "intake"}'
    

  6. If readiness checks consistently fail for a specific location, check:

  7. Is the website_url on the business record correct and publicly accessible?
  8. Is the SSRF validator blocking the URL (e.g., the website resolves to an internal/private IP)?

Incident 16: Readiness Score Shows 0 or Is Unexpectedly Low

Severity: Low Who can fix: Technical (founder)

Steps:

  1. Check the readiness_checks records for the affected score:
    SELECT rc.category, rc.check_name, rc.status, rc.value, rc.message
    FROM readiness_checks rc
    WHERE rc.score_id = 'SCORE-UUID-HERE';
    
  2. If status is 'failed' for all checks, no useful scoring data was collected. The score will be low by design.
  3. Look at the message field to understand what the checker flagged.

  4. A score of 0 for a category usually means the checker encountered an exception that was caught and marked the check as failed. Check Railway logs around the time the score was created for errors from crawlability, schema, speed, or structured_data checkers.

  5. Common causes for low scores:

  6. Crawlability 0: Website has a robots.txt blocking all crawlers, or the site returned a non-200 status.
  7. Schema 0: No JSON-LD or structured data found on the homepage.
  8. Speed 0: PageSpeed Insights returned no data (quota exceeded, API key missing, or the URL returned an error).
  9. Structured data 0: JSON-LD present but failed the NAP consistency check (business name/address/phone in schema does not match onboarding data).

  10. If the score seems incorrect and the website has been recently updated, re-trigger a fresh check:

    curl -X POST https://api.aireadyplumber.com/api/v1/readiness/LOCATION-UUID/check \
      -H "Authorization: Bearer YOUR-ADMIN-TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"trigger": "post_deployment"}'
    


Incident 17: Visibility Adapter Error (Google AIO / Gemini)

Severity: Medium Who can fix: Technical (founder)

Steps:

  1. Identify which adapter(s) are failing. Check visibility_checks for incomplete results:

    SELECT vc.id, vc.engine, vc.created_at,
           vc.analysis_status, vc.analysis_error
    FROM visibility_checks vc
    LEFT JOIN visibility_check_analyses vca ON vca.check_id = vc.id
    WHERE vc.score_id = 'SCORE-UUID-HERE'
    ORDER BY vc.engine;
    
    Look for analysis_status = 'failed' and read the analysis_error field.

  2. Check Railway logs for adapter-specific errors. Search for Google AIO query failed or Gemini query failed:

  3. 401 Unauthorized from DataForSEO — DATAFORSEO_LOGIN or DATAFORSEO_PASSWORD is wrong or the account is suspended.
  4. 403 Forbidden from Google — GOOGLE_GEMINI_API_KEY is invalid or the Generative Language API is not enabled in the GCP project.
  5. 429 Too Many Requests — rate limit hit. Wait and re-trigger.
  6. asyncio.TimeoutError — adapter timed out. The DataForSEO and Gemini adapters each have a timeout parameter (default 30s). Under high load, this may need increasing.
  7. Tasks returned status_code != 20000 — DataForSEO task-level error. The query may be malformed or the SERP endpoint quota is exhausted.

  8. Verify the relevant API keys are set and non-empty in Railway:

  9. DATAFORSEO_LOGIN and DATAFORSEO_PASSWORD — required for Google AIO adapter.
  10. GOOGLE_GEMINI_API_KEY — required for Gemini adapter.
  11. Note: These adapters are dormant by default. They only activate when credentials are present. If you see them producing errors, the keys were recently added but are invalid.

  12. Check the engine weights configuration. If an adapter is activated by accident (credentials set but weights left at 0), it will still run and generate errors. Verify ENGINE_WEIGHTS in Railway matches the intended active engines:

    {"chatgpt": 0.5, "perplexity": 0.5, "google_aio": 0.0, "gemini": 0.0}
    
    Set an engine's weight to 0.0 to effectively disable it from scoring (it will still run but won't affect the composite score).

  13. To re-trigger a visibility scan after fixing credentials:

    curl -X POST https://api.aireadyplumber.com/api/v1/portal/my-project/measure-visibility \
      -H "Authorization: Bearer CUSTOMER-TOKEN"
    
    Or use the admin reanalyze endpoint:
    curl -X POST https://api.aireadyplumber.com/api/v1/admin/visibility/SCORE-UUID/reanalyze \
      -H "Authorization: Bearer YOUR-ADMIN-TOKEN"
    

  14. If adapters are consistently failing and you need to exclude them temporarily, set their weight to 0.0 in ENGINE_WEIGHTS and deploy. The remaining active adapters will continue scoring normally with renormalized weights.


Quick Reference: Escalation Paths

Incident Operator can handle? Escalate to founder when?
1. Customer can't log in Yes (first 3 steps) Supabase config or JWKS issues
2. Missing email Yes GHL webhook URL misconfigured
3. Stuck at gate Yes Never -- this is operator's job
4. Pipeline failed No Always
5. Visibility shows 0 No Always
6. Scan stuck measuring No Always
7. Wrong content/images No Always
8. ZIP download fails No Always
9. GHL webhook not firing Yes (check GHL side) Backend config issues
10. Dashboard not loading Yes (first 2 steps) Backend down or config wrong
11. 502 Bad Gateway No Always
12. Stuck at photo gate Yes (check count + GHL) Auto-resume failed (>= 100 photos but not resumed)
13. Revision execution failed Yes (approve/reject) Retry endpoint fails, code-level error
14. Revision cost over threshold Yes (approve or reject) Never -- this is operator's job
15. Readiness check stuck No Always
16. Readiness score low/zero No Always
17. Visibility adapter error No Always

Quick Reference: Key Database Tables for Debugging

Table What it tells you
users User accounts, roles, active status, Supabase ID linkage
pipeline_runs Pipeline status, current step, current batch, error messages, attempt count, photo_gate_passed
batches Per-batch status, failed_at_status for recovery, zip_url
articles Article content, batch assignment, review status, HTML output
article_versions Snapshot history for each article; created on every revision cycle
images Customer-uploaded photos per location, quality scores, upload status
deployments Deployment status machine, hub_page_url, visibility score linkage
data_cards BI brief, client dossier, test prompts
pending_leads Pre-purchase records from GHL webhook
notifications In-app notification history
batch_revisions Per-batch revision requests, status (pending_approval/approved/executing/failed), estimated_cost
revision_instructions Per-article revision tasks within a batch revision; execution_status and error_detail
readiness_scores Per-location readiness score runs (intake or post_deployment), composite score, status
readiness_checks Individual category checks (crawlability, schema, speed, structured_data) within a score run
visibility_check_analyses Deep analysis results for individual visibility checks (coverage_score, quality_score, sentiment, citation_context)

Quick Reference: Key API Endpoints for Debugging

Endpoint Method Auth Purpose
/health GET None Backend health check
/api/v1/me GET Any Current user info and role routing
/api/v1/admin/projects/{id}/gate POST Admin Approve or reject a quality gate (also resumes pipeline)
/api/v1/projects/{location_id}/retry POST Admin Retry a failed pipeline
/api/v1/admin/projects/{id}/batches GET Admin List all batches and their statuses for a project
/api/v1/admin/revisions/{revision_id}/retry POST Admin Retry a stuck revision execution
/api/v1/admin/visibility/{score_id}/reanalyze POST Admin Re-run deep analysis on a visibility score
/api/v1/portal/my-project/measure-visibility POST Customer Trigger on-demand visibility scan
/api/v1/portal/my-project/batches/{batch_number}/confirm-deployment POST Customer Confirm batch deployment with checklist
/api/v1/readiness/{location_id}/check POST Admin Trigger a readiness check (intake or post_deployment)
/api/v1/readiness/{location_id}/scores/latest GET Authenticated (location access) Latest readiness scores for a location
/api/v1/portal/my-project/photos GET Customer Paginated photo list for a location
/api/v1/portal/my-project/photos POST Customer Upload photos (up to 300 total, 100 minimum gate)
/api/v1/admin/metrics GET Admin System metrics overview