Incident Runbook¶
Troubleshooting recipes for the most common production issues. Organized by symptom so you can find the right fix fast.
Audience: Operators (non-technical) and the founder (technical escalation).
How to Access Railway Logs¶
Most incidents require checking logs. Here is how to get to them:
- Go to railway.com and open the AEO Bunny project.
- Click the backend service (the FastAPI app).
- Go to Deployments and click the active (green) deployment.
- Click View Logs to see live output. Scroll up for recent history.
Or from the terminal:
railway logs --tail 200
Logs are your first stop for any issue marked "Technical."
Incident 1: Customer Can't Log In¶
Severity: Medium Who can fix: Operator (first checks) / Technical (if Supabase issue)
Steps:
- Ask the customer to confirm they are using the correct email address.
- Check the Supabase dashboard (Authentication > Users). Search for the customer's email.
- If the user does not exist in Supabase Auth, the purchase webhook may not have fired. Check Incident 9 (GHL Webhook Not Firing).
- If the user exists in Supabase Auth, check the app
userstable in the Supabase SQL Editor:SELECT id, supabase_id, email, role, is_active FROM users WHERE email = 'customer@example.com'; - If no row exists, the onboarding step was not completed. The customer needs to complete the onboarding form at
/onboard?email=.... - If
is_activeisfalse, the account was deactivated. Set it totrueif appropriate. - If the
supabase_iddoes not match the Supabase Auth UUID, the records are out of sync. Update thesupabase_idcolumn to match. - Check that Supabase redirect URLs include the production portal domain. In the Supabase dashboard, go to Authentication > URL Configuration and confirm these are listed:
https://portal.aireadyplumber.com/**https://portal.aireadyplumber.com/loginhttps://portal.aireadyplumber.com/onboardhttps://portal.aireadyplumber.com/reset-password- If the customer sees "Invalid authentication token," their JWT may be expired. Ask them to log out and log back in. If it persists, check Railway logs for
JWKS lookup failederrors, which indicate the backend cannot reach Supabase's JWKS endpoint (network issue). - If still broken, escalate to the founder with the customer's email and the error message they see.
Incident 2: Customer Didn't Receive an Email¶
Severity: Low Who can fix: Operator
Steps:
- Ask the customer to check their spam/junk folder. GHL-routed emails often land there on first contact.
- Check the GHL webhook delivery logs in GoHighLevel:
- Go to the automation that handles the relevant event (e.g.,
password_reset_requested,approval_needed). - Confirm the automation is active (not paused or draft).
- Check the execution history for the customer's contact. Look for failed deliveries.
- Check Railway logs for the specific webhook event:
- Search for
GHL webhook sent: password_reset_requested(or whatever event the customer expected). - If you see
GHL webhook failed for event ... (non-fatal), the backend tried to send but GHL rejected it. CheckGHL_WEBHOOK_URLin Railway env vars. - If you see
GHL webhook URL not configured, skipping event, theGHL_WEBHOOK_URLenvironment variable is not set. - For password reset specifically:
- Confirm
PORTAL_BASE_URLis set correctly in Railway (should behttps://portal.aireadyplumber.com). - The reset link is generated by the backend and sent via GHL webhook, not via Supabase email templates.
- If GHL automations are correctly configured and the backend shows successful webhook sends, the issue is on the GHL email delivery side. Check the customer's GHL contact record for email delivery status.
Incident 3: Pipeline Stuck at a Quality Gate¶
Severity: Medium Who can fix: Operator
Steps:
- Open the admin dashboard and go to the Approvals page. Look for projects showing "Paused" status.
- Identify which gate the pipeline is paused at. Common gates:
gate_bi_review-- BI research needs operator review (Phase A).gate_matrix_review-- Content strategy matrix needs operator review (Phase A).gate_batch_article_review-- Batch articles need review. Batch 1 goes to operator first; batches 2-5 go to customer.gate_batch_html_review-- Assembled HTML pages need review. Same routing as article review.gate_batch_deploy_confirm-- Batch is ready to ship and awaiting deployment confirmation.- Click into the project and review the content at the gate.
- Click Approve to resume the pipeline, or Reject with a note explaining what needs to change.
- If the gate is a customer gate (batches 2-5), the customer must take action from their portal. Contact them via GHL to remind them there is content awaiting their review.
- This is expected behavior. The pipeline is designed to pause at quality gates for human review. A project sitting at a gate is not broken -- it is waiting for someone to act.
Incident 4: Pipeline Failed Mid-Run¶
Severity: High Who can fix: Technical (founder)
Steps:
- Check Railway logs for the error. Search for
Pipeline failed for location. The log will include the traceback. - Check the
pipeline_runstable for the error message:SELECT id, location_id, status, current_step, current_batch_number, error_message, attempts FROM pipeline_runs WHERE location_id = 'LOCATION-UUID-HERE'; - Identify the failure cause from the error message. Common causes:
| Error pattern | Cause | Fix |
|---|---|---|
anthropic.RateLimitError |
Anthropic API rate limit hit | Wait 1-2 minutes, then retry. If persistent, check API tier/usage. |
asyncpg.ConnectionDoesNotExist or connection is closed |
Supabase pooler dropped the connection | Retry. If frequent, check Supabase pooler health and connection limits. |
ValidationError / pydantic |
Agent returned unexpected output format | Check the prompt template. May need a retry or prompt fix. |
Article writing failed for categories |
One or more parallel article writes failed | Check which category failed in the traceback. Usually a rate limit or validation error. |
No DataCard found or No BI brief |
Data prerequisites missing | The BI step may not have completed. Check if the DataCard exists. |
httpx.ConnectTimeout |
External API (Anthropic, OpenAI, etc.) unreachable | Check service status pages. Retry when service recovers. |
- To retry the pipeline via the admin API:
Or click the Retry button on the project detail page in the admin dashboard.
curl -X POST https://api.aireadyplumber.com/api/v1/projects/LOCATION-UUID/retry \ -H "Authorization: Bearer YOUR-ADMIN-TOKEN"
The pipeline will resume from where it left off (crash recovery). It will not re-do completed steps. Maximum retry attempts is controlled by MAX_PIPELINE_ATTEMPTS (default: 3).
-
If the batch itself failed, the
batchestable will showstatus = "failed"withfailed_at_statusindicating where it was when the failure occurred. The retry remaps the batch status for recovery:writingis restored topending(re-runs from scratch), andbuilding_htmlis restored toarticle_review(re-runs from the review gate). Other statuses are restored as-is. -
If the error is a code bug or prompt issue that cannot be fixed by retrying, escalate and fix the root cause before retrying.
Incident 5: Visibility Scan Shows 0 After Deployment¶
Severity: Medium Who can fix: Technical (founder)
Steps:
- Confirm the deployment record reached "confirmed" status:
SELECT id, status, hub_page_url, visibility_score_id FROM deployments WHERE location_id = 'LOCATION-UUID-HERE' ORDER BY created_at DESC LIMIT 1; - If
statusis"pending", the customer has not confirmed deployment yet. - If
statusis"measurement_failed", the scan ran but errored (see step 3). -
If
statusis"confirmed"with novisibility_score_id, the background task may not have started. -
Verify the API keys are valid in Railway environment variables:
OPENAI_API_KEY-- required for ChatGPT visibility checks.PERPLEXITY_API_KEY-- required for Perplexity visibility checks.-
Test them quickly:
# OpenAI curl https://api.openai.com/v1/models -H "Authorization: Bearer sk-..." # Perplexity curl https://api.perplexity.ai/chat/completions \ -H "Authorization: Bearer pplx-..." \ -H "Content-Type: application/json" \ -d '{"model":"sonar","messages":[{"role":"user","content":"test"}]}' -
Check Railway logs for
Visibility scan failed for deployment. The traceback will indicate the specific failure: No visibility adapters configured-- API keys are missing or empty.No BI brief found-- The DataCard has no BI brief; the BI step may not have completed.No test prompts in BI brief-- The BI Agent did not generate test prompts.-
Rate limit or timeout errors from OpenAI/Perplexity -- wait and re-trigger.
-
Verify the
hub_page_urlis publicly accessible. The visibility engines need to be able to find the business online. Try loading the URL in a browser. -
To re-trigger measurement (rate-limited to 1/hour per customer):
Or the customer can click "Measure Visibility" in their portal.curl -X POST https://api.aireadyplumber.com/api/v1/portal/my-project/measure-visibility \ -H "Authorization: Bearer CUSTOMER-TOKEN"
Incident 6: Visibility Scan Stuck in "Measuring"¶
Severity: Low Who can fix: Technical (founder)
Steps:
-
Check the deployment status:
IfSELECT id, status, created_at, updated_at FROM deployments WHERE location_id = 'LOCATION-UUID-HERE' ORDER BY created_at DESC LIMIT 1;status = "measuring"andupdated_atis more than 10 minutes ago, the background task likely timed out or crashed silently. -
Check Railway logs for
Visibility scan failedorasyncio.TimeoutError. The scan has a 10-minute timeout (asyncio.wait_for(timeout=600.0)). -
Manual resolution: Reset the deployment status and re-trigger:
Then trigger a new measurement via the customer portal or API (see Incident 5, step 5).UPDATE deployments SET status = 'measurement_failed' WHERE id = 'DEPLOYMENT-UUID-HERE' AND status = 'measuring'; -
If this happens repeatedly, check:
- Are the OpenAI/Perplexity APIs responding slowly? The per-adapter timeout may need adjustment.
- Is the backend under heavy load? Multiple concurrent visibility scans are guarded by a semaphore, but extreme load could cause queuing.
Incident 7: Customer Sees Wrong Content or Missing Images¶
Severity: Medium Who can fix: Technical (founder)
Steps:
-
Identify which article or page has the issue. Get the article slug and location ID from the customer or admin dashboard.
-
Check the R2 bucket for the correct files:
- HTML files are stored at:
{location_id}/html/{slug}.html - Images are stored at:
{location_id}/images/{filename} -
Use the R2 dashboard (Cloudflare) or the AWS CLI with R2 credentials to list/view files:
aws s3 ls s3://aeo-bunny-images/{location_id}/html/ \ --endpoint-url https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com -
Check the
R2_PUBLIC_URLenvironment variable in Railway. The HTML output references images using this base URL. If it is wrong or expired, images will 404. -
Open the HTML file and check the image
srcattributes. They should point to{R2_PUBLIC_URL}/{location_id}/images/{filename}. If the paths are wrong, the HTML assembly step may have used incorrect configuration. -
To fix: You may need to re-run the HTML assembly step for the affected batch. This requires:
- Setting the batch status back to the pre-HTML-assembly state.
- Re-triggering the pipeline to rebuild HTML for that batch.
-
This is a technical operation -- work with the codebase directly.
-
If a single image is missing from R2 but the HTML is correct, re-upload the image to the correct key in R2.
Incident 8: ZIP Download Fails or Is Empty¶
Severity: Medium Who can fix: Technical (founder)
Steps:
-
Check which batch the customer is trying to download. Get the batch number and location ID.
-
Verify articles exist for that batch:
SELECT id, title, batch_number, body_html IS NOT NULL AS has_html FROM articles WHERE location_id = 'LOCATION-UUID-HERE' AND batch_number = BATCH_NUMBER; - If no articles exist for the batch, the writing step did not complete.
-
If
has_htmlis false for some articles, the HTML assembly step did not finish. -
Check R2 for the ZIP file:
- ZIP files are stored at:
{location_id}/batch-{batch_number}.zip. -
Verify it exists and is not 0 bytes.
-
Check Railway logs for errors during ZIP generation. Search for the location ID around the time the batch was built.
-
Check R2 connectivity from the backend:
- Look for
R2orCloudStorageClienterrors in Railway logs. -
Verify
R2_ACCOUNT_ID,R2_ACCESS_KEY_ID, andR2_SECRET_ACCESS_KEYare set and valid. -
To re-trigger ZIP generation: The ZIP is built as part of the batch pipeline flow. If the batch shows
status = "ready_to_ship"but the ZIP is missing or broken, you may need to manually rebuild it. Check thebatchestable for thezip_urlvalue. -
If R2 is unreachable entirely (all downloads failing), check:
- Cloudflare R2 service status.
- Whether the R2 API token has expired or been revoked.
Incident 9: GHL Webhook Not Firing¶
Severity: Low Who can fix: Operator (first checks) / Technical (if config issue)
Steps:
- Check Railway logs for GHL webhook messages:
GHL webhook sent: EVENT_NAME (audience=..., status=200)-- webhook sent successfully.GHL webhook failed for event EVENT_NAME (non-fatal)-- webhook attempted but failed. The pipeline continues regardless.-
GHL webhook URL not configured, skipping event--GHL_WEBHOOK_URLis not set. -
Verify environment variables in Railway:
GHL_WEBHOOK_URL-- must be set to the GHL webhook endpoint (e.g.,https://services.leadconnectorhq.com/hooks/...).-
Inbound webhook auth now uses Ed25519 signature verification via
X-GHL-Signature(primary).GHL_WEBHOOK_SECRETis legacy fallback only. -
Check the GHL side:
- Is the webhook endpoint still active in GHL? Endpoints can be disabled or deleted.
- Is the automation attached to the webhook active?
-
Check GHL's webhook logs for incoming payloads.
-
GHL webhook failures are non-fatal. The pipeline, deployments, and all other operations continue normally even if webhooks fail. Webhooks are notification-only. The backend catches all webhook exceptions and logs them as warnings.
-
If webhooks were not firing for a period and you need to resend notifications, you can check the
notificationstable for the events that should have been sent and manually trigger the corresponding GHL automations.
Incident 10: Admin Dashboard Not Loading¶
Severity: Medium Who can fix: Operator (first checks) / Technical (if backend issue)
Steps:
- Check Vercel deployment status:
- Go to vercel.com and open the portal project.
- Check if the latest deployment succeeded (green checkmark) or failed (red X).
-
If the deployment failed, check the build logs for errors.
-
Open the browser developer console (F12 or right-click > Inspect > Console tab). Look for:
- Red error messages, especially
CORSerrors orFailed to fetch. 401 Unauthorized-- your session may have expired. Log out and log back in.-
net::ERR_CONNECTION_REFUSED-- the backend is down. -
Verify
NEXT_PUBLIC_API_URLin Vercel environment variables points to the correct backend: - Should be:
https://api.aireadyplumber.com -
If it points to
localhostor a staging URL, the frontend cannot reach the production backend. -
Check if the backend health endpoint responds:
curl https://api.aireadyplumber.com/health - If it returns
{"status":"ok"}, the backend is alive and the issue is frontend-only. -
If it times out or returns an error, see Incident 11 (502 Bad Gateway).
-
Check that
NEXT_PUBLIC_SUPABASE_URLandNEXT_PUBLIC_SUPABASE_ANON_KEYin Vercel match the Supabase project. Mismatched keys will cause auth failures. -
Try loading the page in an incognito/private window to rule out browser cache or extension issues.
-
If the dashboard loads but shows no data, the API calls may be failing silently. Check the browser Network tab for failed requests (red entries).
Incident 11: "502 Bad Gateway" on the API¶
Severity: High Who can fix: Technical (founder)
Steps:
- Check Railway deployment status:
- Go to railway.com and open the backend service.
- Is the deployment active (green) or crashed (red)?
-
If crashed, click into the deployment and check the logs for the crash reason.
-
Check Railway logs for startup errors. Common causes:
| Error pattern | Cause | Fix |
|---|---|---|
ModuleNotFoundError |
Missing dependency | Check requirements.txt or Docker build. Redeploy. |
sqlalchemy.exc.ArgumentError: Could not parse |
DATABASE_URL is malformed |
The code accepts postgres://, postgresql://, and postgresql+asyncpg:// schemes and normalizes them automatically. If this error appears, the URL has a deeper format issue (missing host, bad encoding, etc.). |
Connection refused on port 5432 |
Cannot reach Supabase DB | Check Supabase status, verify DATABASE_URL. |
alembic.util.exc.CommandError |
Database schema out of date | Run railway run alembic upgrade head. |
RuntimeError: no running event loop |
Async setup issue | Usually a code bug. Check recent commits. |
| Address/port binding error | Railway port configuration | Ensure the app binds to 0.0.0.0:$PORT. |
-
Run database migrations if needed:
cd aeo_bunny railway run alembic upgrade head -
Verify the
DATABASE_URLformat. The code acceptspostgres://,postgresql://, andpostgresql+asyncpg://schemes and normalizes them automatically topostgresql+asyncpg://. Example:Common mistakes:postgresql+asyncpg://postgres.[ref]:[password]@aws-0-[region].pooler.supabase.com:5432/postgres - Using the direct connection string instead of the Session Pooler string.
-
Password containing special characters that are not URL-encoded.
-
If the service crashed and won't restart:
- Try a manual redeploy from the Railway dashboard.
- Check if Railway has a resource limit issue (memory/CPU).
-
Check if a recent code push introduced a breaking change.
-
If the backend starts but immediately crashes under load, check for:
- Database connection pool exhaustion (too many concurrent requests).
- Memory issues from large pipeline runs.
Incident 12: Pipeline Stuck at Photo Gate¶
Severity: Medium Who can fix: Operator (first checks) / Technical (if auto-resume failed)
Steps:
- Check the photo count for the customer's location:
SELECT COUNT(*) FROM images WHERE location_id = 'LOCATION-UUID-HERE'; - If count < 100, the customer has not uploaded enough photos. The pipeline is correctly waiting.
-
If count >= 100, the auto-resume logic should have fired but did not (see step 3).
-
If count < 100: Contact the customer via GHL and remind them to upload at least 100 photos at
/portal/photos. Verify thephotos_upload_neededwebhook was delivered (check GHL automation history for that contact). -
If count >= 100 but pipeline has not resumed: Check the
photo_gate_passedflag:IfSELECT photo_gate_passed, status, current_step FROM pipeline_runs WHERE location_id = 'LOCATION-UUID-HERE';photo_gate_passed = FALSEdespite enough photos, the CAS-based auto-resume failed (race condition or backend error during the upload that hit the threshold). Manually set the flag and re-queue:Then trigger the pipeline via the admin API:UPDATE pipeline_runs SET photo_gate_passed = TRUE, status = 'queued', current_step = 'gate_photo_upload' WHERE location_id = 'LOCATION-UUID-HERE';# Pipeline resumes via the gate approval endpoint: curl -X POST https://api.aireadyplumber.com/api/v1/admin/projects/LOCATION-UUID/gate \ -H "Authorization: Bearer YOUR-ADMIN-TOKEN" \ -H "Content-Type: application/json" \ -d '{"action": "approve"}' -
Note:
photo_gate_passed = TRUEon the pipeline run means the gate has been cleared and will not re-block batches 2-5. Only batch 1 HTML assembly is held by this gate.
Incident 13: Revision Execution Failed¶
Severity: High Who can fix: Technical (founder) / Operator (manual approval/rejection)
Steps:
-
Identify the stuck revision. Check for
BatchRevisionrecords inexecutingstatus with a staleupdated_at:SELECT id, batch_id, status, estimated_cost, updated_at FROM batch_revisions WHERE status = 'executing' AND updated_at < NOW() - INTERVAL '30 minutes'; -
Check the
revision_instructionstable for the failed revision to see individual article-level failures:TheSELECT ri.id, ri.article_id, ri.execution_status, ri.error_detail FROM revision_instructions ri WHERE ri.batch_revision_id = 'BATCH-REVISION-UUID-HERE' AND ri.execution_status = 'failed';error_detailfield will contain the specific error (rate limit, validation error, DB error). -
Check Railway logs around the time
updated_atwas last updated. Search forrevisionor the batch revision ID to find the traceback. -
To retry: Use the admin retry endpoint:
curl -X POST https://api.aireadyplumber.com/api/v1/admin/revisions/REVISION-UUID/retry \ -H "Authorization: Bearer YOUR-ADMIN-TOKEN" -
If retry is not available or keeps failing: Manually mark the revision as
failedto unblock the batch, then let the operator re-approve when ready:The customer can then re-submit feedback and a new revision cycle can begin.UPDATE batch_revisions SET status = 'failed' WHERE id = 'BATCH-REVISION-UUID-HERE' AND status = 'executing'; -
If a single
revision_instructionrecord failed but others succeeded, the partial failure may leave some articles unrevised. Review those articles individually and decide whether to re-trigger or manually update.
Incident 14: Revision Auto-Approval Bypassed (Cost Over Threshold)¶
Severity: Low Who can fix: Operator
Steps:
-
A
BatchRevisionentersstatus = 'pending_approval'even whenAUTO_APPROVE_REVISIONS=true. This is expected behavior when theestimated_costexceedsREVISION_COST_THRESHOLD. It is not a bug — it is a cost-control gate. -
Verify the cost threshold is triggering correctly:
CompareSELECT id, batch_id, status, estimated_cost, created_at FROM batch_revisions WHERE status = 'pending_approval' ORDER BY created_at DESC LIMIT 5;estimated_costagainst theREVISION_COST_THRESHOLDsetting (check Railway env vars or the Settings page in the admin dashboard). -
Review the revision instructions to understand why the cost estimate is high. A large batch of articles with extensive revision notes will produce a higher estimate.
-
Operator action: Go to the admin dashboard, open the project, find the pending revision, and either:
- Approve — the revision proceeds and articles are rewritten.
-
Reject — the revision is cancelled. The customer can submit new feedback with a narrower scope.
-
If the threshold is consistently too low for your typical revision scope, adjust
REVISION_COST_THRESHOLDin Railway environment variables or via the Settings page (super_admin only).
Incident 15: Readiness Check Stuck in "Running"¶
Severity: Medium Who can fix: Technical (founder)
Steps:
-
Check for readiness scores stuck in
runningstatus:SELECT id, location_id, trigger, status, created_at, updated_at FROM readiness_scores WHERE status = 'running' AND updated_at < NOW() - INTERVAL '10 minutes'; -
Check Railway logs for
Readiness engine failed for locationorhttpx.ReadTimeout. The engine has per-checker HTTP timeouts. If the Google PageSpeed Insights API is slow or rate-limiting, the speed checker can block progress. -
Verify the
PAGESPEED_API_KEYenvironment variable is set and valid in Railway. The speed checker still runs without an API key (using the unauthenticated PageSpeed Insights endpoint), but unauthenticated requests have stricter rate limits and may produce quota errors or timeouts under load. -
Check if the target website is blocking the readiness engine's HTTP requests. Some websites block automated HTTP clients, causing
crawlabilitychecker timeouts. -
Manual resolution: Reset the stuck score to
failedand re-trigger:Then re-trigger via the admin API:UPDATE readiness_scores SET status = 'failed', error_message = 'Manual reset: was stuck in running' WHERE id = 'SCORE-UUID-HERE' AND status = 'running';curl -X POST https://api.aireadyplumber.com/api/v1/readiness/LOCATION-UUID/check \ -H "Authorization: Bearer YOUR-ADMIN-TOKEN" \ -H "Content-Type: application/json" \ -d '{"trigger": "intake"}' -
If readiness checks consistently fail for a specific location, check:
- Is the
website_urlon the business record correct and publicly accessible? - Is the SSRF validator blocking the URL (e.g., the website resolves to an internal/private IP)?
Incident 16: Readiness Score Shows 0 or Is Unexpectedly Low¶
Severity: Low Who can fix: Technical (founder)
Steps:
- Check the
readiness_checksrecords for the affected score:SELECT rc.category, rc.check_name, rc.status, rc.value, rc.message FROM readiness_checks rc WHERE rc.score_id = 'SCORE-UUID-HERE'; - If
statusis'failed'for all checks, no useful scoring data was collected. The score will be low by design. -
Look at the
messagefield to understand what the checker flagged. -
A score of 0 for a category usually means the checker encountered an exception that was caught and marked the check as failed. Check Railway logs around the time the score was created for errors from
crawlability,schema,speed, orstructured_datacheckers. -
Common causes for low scores:
- Crawlability 0: Website has a
robots.txtblocking all crawlers, or the site returned a non-200 status. - Schema 0: No JSON-LD or structured data found on the homepage.
- Speed 0: PageSpeed Insights returned no data (quota exceeded, API key missing, or the URL returned an error).
-
Structured data 0: JSON-LD present but failed the NAP consistency check (business name/address/phone in schema does not match onboarding data).
-
If the score seems incorrect and the website has been recently updated, re-trigger a fresh check:
curl -X POST https://api.aireadyplumber.com/api/v1/readiness/LOCATION-UUID/check \ -H "Authorization: Bearer YOUR-ADMIN-TOKEN" \ -H "Content-Type: application/json" \ -d '{"trigger": "post_deployment"}'
Incident 17: Visibility Adapter Error (Google AIO / Gemini)¶
Severity: Medium Who can fix: Technical (founder)
Steps:
-
Identify which adapter(s) are failing. Check
visibility_checksfor incomplete results:Look forSELECT vc.id, vc.engine, vc.created_at, vc.analysis_status, vc.analysis_error FROM visibility_checks vc LEFT JOIN visibility_check_analyses vca ON vca.check_id = vc.id WHERE vc.score_id = 'SCORE-UUID-HERE' ORDER BY vc.engine;analysis_status = 'failed'and read theanalysis_errorfield. -
Check Railway logs for adapter-specific errors. Search for
Google AIO query failedorGemini query failed: 401 Unauthorizedfrom DataForSEO —DATAFORSEO_LOGINorDATAFORSEO_PASSWORDis wrong or the account is suspended.403 Forbiddenfrom Google —GOOGLE_GEMINI_API_KEYis invalid or the Generative Language API is not enabled in the GCP project.429 Too Many Requests— rate limit hit. Wait and re-trigger.asyncio.TimeoutError— adapter timed out. The DataForSEO and Gemini adapters each have atimeoutparameter (default 30s). Under high load, this may need increasing.-
Tasks returned status_code != 20000— DataForSEO task-level error. The query may be malformed or the SERP endpoint quota is exhausted. -
Verify the relevant API keys are set and non-empty in Railway:
DATAFORSEO_LOGINandDATAFORSEO_PASSWORD— required for Google AIO adapter.GOOGLE_GEMINI_API_KEY— required for Gemini adapter.-
Note: These adapters are dormant by default. They only activate when credentials are present. If you see them producing errors, the keys were recently added but are invalid.
-
Check the engine weights configuration. If an adapter is activated by accident (credentials set but weights left at 0), it will still run and generate errors. Verify
ENGINE_WEIGHTSin Railway matches the intended active engines:Set an engine's weight to{"chatgpt": 0.5, "perplexity": 0.5, "google_aio": 0.0, "gemini": 0.0}0.0to effectively disable it from scoring (it will still run but won't affect the composite score). -
To re-trigger a visibility scan after fixing credentials:
Or use the admin reanalyze endpoint:curl -X POST https://api.aireadyplumber.com/api/v1/portal/my-project/measure-visibility \ -H "Authorization: Bearer CUSTOMER-TOKEN"curl -X POST https://api.aireadyplumber.com/api/v1/admin/visibility/SCORE-UUID/reanalyze \ -H "Authorization: Bearer YOUR-ADMIN-TOKEN" -
If adapters are consistently failing and you need to exclude them temporarily, set their weight to
0.0inENGINE_WEIGHTSand deploy. The remaining active adapters will continue scoring normally with renormalized weights.
Quick Reference: Escalation Paths¶
| Incident | Operator can handle? | Escalate to founder when? |
|---|---|---|
| 1. Customer can't log in | Yes (first 3 steps) | Supabase config or JWKS issues |
| 2. Missing email | Yes | GHL webhook URL misconfigured |
| 3. Stuck at gate | Yes | Never -- this is operator's job |
| 4. Pipeline failed | No | Always |
| 5. Visibility shows 0 | No | Always |
| 6. Scan stuck measuring | No | Always |
| 7. Wrong content/images | No | Always |
| 8. ZIP download fails | No | Always |
| 9. GHL webhook not firing | Yes (check GHL side) | Backend config issues |
| 10. Dashboard not loading | Yes (first 2 steps) | Backend down or config wrong |
| 11. 502 Bad Gateway | No | Always |
| 12. Stuck at photo gate | Yes (check count + GHL) | Auto-resume failed (>= 100 photos but not resumed) |
| 13. Revision execution failed | Yes (approve/reject) | Retry endpoint fails, code-level error |
| 14. Revision cost over threshold | Yes (approve or reject) | Never -- this is operator's job |
| 15. Readiness check stuck | No | Always |
| 16. Readiness score low/zero | No | Always |
| 17. Visibility adapter error | No | Always |
Quick Reference: Key Database Tables for Debugging¶
| Table | What it tells you |
|---|---|
users |
User accounts, roles, active status, Supabase ID linkage |
pipeline_runs |
Pipeline status, current step, current batch, error messages, attempt count, photo_gate_passed |
batches |
Per-batch status, failed_at_status for recovery, zip_url |
articles |
Article content, batch assignment, review status, HTML output |
article_versions |
Snapshot history for each article; created on every revision cycle |
images |
Customer-uploaded photos per location, quality scores, upload status |
deployments |
Deployment status machine, hub_page_url, visibility score linkage |
data_cards |
BI brief, client dossier, test prompts |
pending_leads |
Pre-purchase records from GHL webhook |
notifications |
In-app notification history |
batch_revisions |
Per-batch revision requests, status (pending_approval/approved/executing/failed), estimated_cost |
revision_instructions |
Per-article revision tasks within a batch revision; execution_status and error_detail |
readiness_scores |
Per-location readiness score runs (intake or post_deployment), composite score, status |
readiness_checks |
Individual category checks (crawlability, schema, speed, structured_data) within a score run |
visibility_check_analyses |
Deep analysis results for individual visibility checks (coverage_score, quality_score, sentiment, citation_context) |
Quick Reference: Key API Endpoints for Debugging¶
| Endpoint | Method | Auth | Purpose |
|---|---|---|---|
/health |
GET | None | Backend health check |
/api/v1/me |
GET | Any | Current user info and role routing |
/api/v1/admin/projects/{id}/gate |
POST | Admin | Approve or reject a quality gate (also resumes pipeline) |
/api/v1/projects/{location_id}/retry |
POST | Admin | Retry a failed pipeline |
/api/v1/admin/projects/{id}/batches |
GET | Admin | List all batches and their statuses for a project |
/api/v1/admin/revisions/{revision_id}/retry |
POST | Admin | Retry a stuck revision execution |
/api/v1/admin/visibility/{score_id}/reanalyze |
POST | Admin | Re-run deep analysis on a visibility score |
/api/v1/portal/my-project/measure-visibility |
POST | Customer | Trigger on-demand visibility scan |
/api/v1/portal/my-project/batches/{batch_number}/confirm-deployment |
POST | Customer | Confirm batch deployment with checklist |
/api/v1/readiness/{location_id}/check |
POST | Admin | Trigger a readiness check (intake or post_deployment) |
/api/v1/readiness/{location_id}/scores/latest |
GET | Authenticated (location access) | Latest readiness scores for a location |
/api/v1/portal/my-project/photos |
GET | Customer | Paginated photo list for a location |
/api/v1/portal/my-project/photos |
POST | Customer | Upload photos (up to 300 total, 100 minimum gate) |
/api/v1/admin/metrics |
GET | Admin | System metrics overview |