Troubleshooting
Use this runbook to narrow incidents quickly before deep code-level debugging.Start with this triage order
- Check server health and metrics reachability
- Verify WebSocket connectivity and auth outcomes
- Confirm room sync progression (pack/request/reconcile)
- Inspect blob availability and upload verification outcomes
- Validate control-plane command authorization and payload shape
Fast symptom index
Clients cannot connect
- Check bind/listener route and proxy upgrade support (
/ws/:room_id) - Verify TLS/
wss://config and cert trust chain - Inspect handshake errors (
expected hello,auth: ...) - Confirm token is valid for room and not expired
- Review close codes (
4002,4008,4001,1011)
Room does not converge
- Confirm peers exchange
packandrequestas expected - Check for malformed pack errors (
bad base64,bad postcard) - Verify subscription scope is not unintentionally filtering required paths
- Compare canonical hash behavior using replay workflows
- Look for frequent resync-required disconnect patterns
Blob operations fail
- Verify
blob-uploadhash/data integrity - For direct blob I/O, check capability negotiation and URL grant path
- Distinguish
upload-denied(fallback path) vsupload-rejected(verify failed) - Confirm
blob-availablebroadcasts and follow-upblob-requestbehavior - Validate storage backend permissions and object availability
Presence looks stale
- Confirm client heartbeat/update cadence is active
- Validate disconnect cleanup behavior in UI logic
- Ensure stale presence is not treated as durable room state
- Check reconnect path restores local presence writes
Control-plane commands rejected
- Confirm capability scopes for command family (archive/query/topology/tick/policy/room)
- Validate required payload fields and value constraints
- Check preconditions (query spec exists, proposal validated, lineage checkpoint valid)
- Parse reject reason class for deterministic next action
Close codes and what they usually mean
4001resync required: peer fell behind or stream budget issue; perform full resync4002token expired: refresh token and reconnect4008rate limit exceeded: tune client send behavior and/or ingress limits1011server overload: investigate capacity/backpressure and retry policy
Debug workflows
1) Connectivity + handshake workflow
- Confirm listener and route exposure
- Run a known-good client hello
- Inspect initial
welcomeenvelope and negotiated caps - Verify token/capability state for locked rooms
2) Convergence workflow
- Capture message window around divergence (
pack,request,mst-*) - Compare one healthy vs failing peer command stream
- Re-run with minimal subscription scope to isolate filter behavior
- Use replay evidence to validate deterministic state
3) Blob workflow
- Verify hash and payload encoding at producer
- Trace request path (
blob-upload/request-upload/blob-uploaded) - Confirm availability relay and consumer fetch path
- Validate backing store read/write permissions
4) Control-plane workflow
- Check capability required for command
- Validate command payload schema
- Validate domain preconditions (archive ref, lineage, promotion lifecycle, query spec)
- Retry with operator token scoped to exact required capability
Evidence bundle to capture
- Relevant server logs and timestamps
- Metrics snapshots around incident window
- Representative command payloads (with secrets redacted)
- Replay output when deterministic-state validation is needed
- Close codes and reconnect attempts from clients
- Capability scope used for failing control-plane requests
Preventive practices
- Keep rate-limit and broadcast settings explicit in config
- Add canary tests for handshake, sync, blob flow, and one control-plane command per family
- Alert on spikes in
errorresponses, close codes, and reject classes - Keep a redaction-safe incident template for command payload capture