Troubleshooting

Use this runbook to narrow incidents quickly before deep code-level debugging.

Start with this triage order

Check server health and metrics reachability
Verify WebSocket connectivity and auth outcomes
Confirm room sync progression (pack/request/reconcile)
Inspect blob availability and upload verification outcomes
Validate control-plane command authorization and payload shape

Fast symptom index

Clients cannot connect

Check bind/listener route and proxy upgrade support (/ws/:room_id)
Verify TLS/wss:// config and cert trust chain
Inspect handshake errors (expected hello, auth: ...)
Confirm token is valid for room and not expired
Review close codes (4002, 4008, 4001, 1011)

Room does not converge

Confirm peers exchange pack and request as expected
Check for malformed pack errors (bad base64, bad postcard)
Verify subscription scope is not unintentionally filtering required paths
Compare canonical hash behavior using replay workflows
Look for frequent resync-required disconnect patterns

Blob operations fail

Verify blob-upload hash/data integrity
For direct blob I/O, check capability negotiation and URL grant path
Distinguish upload-denied (fallback path) vs upload-rejected (verify failed)
Confirm blob-available broadcasts and follow-up blob-request behavior
Validate storage backend permissions and object availability

Presence looks stale

Confirm client heartbeat/update cadence is active
Validate disconnect cleanup behavior in UI logic
Ensure stale presence is not treated as durable room state
Check reconnect path restores local presence writes

Control-plane commands rejected

Confirm capability scopes for command family (archive/query/topology/tick/policy/room)
Validate required payload fields and value constraints
Check preconditions (query spec exists, proposal validated, lineage checkpoint valid)
Parse reject reason class for deterministic next action

Close codes and what they usually mean

4001 resync required: peer fell behind or stream budget issue; perform full resync
4002 token expired: refresh token and reconnect
4008 rate limit exceeded: tune client send behavior and/or ingress limits
1011 server overload: investigate capacity/backpressure and retry policy

Debug workflows

1) Connectivity + handshake workflow

Confirm listener and route exposure
Run a known-good client hello
Inspect initial welcome envelope and negotiated caps
Verify token/capability state for locked rooms

2) Convergence workflow

Capture message window around divergence (pack, request, mst-*)
Compare one healthy vs failing peer command stream
Re-run with minimal subscription scope to isolate filter behavior
Use replay evidence to validate deterministic state

3) Blob workflow

Verify hash and payload encoding at producer
Trace request path (blob-upload/request-upload/blob-uploaded)
Confirm availability relay and consumer fetch path
Validate backing store read/write permissions

4) Control-plane workflow

Check capability required for command
Validate command payload schema
Validate domain preconditions (archive ref, lineage, promotion lifecycle, query spec)
Retry with operator token scoped to exact required capability

Evidence bundle to capture

Relevant server logs and timestamps
Metrics snapshots around incident window
Representative command payloads (with secrets redacted)
Replay output when deterministic-state validation is needed
Close codes and reconnect attempts from clients
Capability scope used for failing control-plane requests

Preventive practices

Keep rate-limit and broadcast settings explicit in config
Add canary tests for handshake, sync, blob flow, and one control-plane command per family
Alert on spikes in error responses, close codes, and reject classes
Keep a redaction-safe incident template for command payload capture

Upgrades and migrations