Upgrades and migrations
Use this runbook to upgrade server/runtime versions without breaking room correctness or operator safety guarantees.Upgrade goals
- Preserve room history correctness
- Maintain client compatibility during rollout
- Avoid irreversible storage surprises
- Keep rollback options explicit
Upgrade strategy by change type
Patch/minor runtime upgrades
Use for low-risk changes with no known contract break:- Roll canary first
- Validate handshake, sync, blob flow, and one control-plane command family
- Expand rollout gradually while watching close/reject/error behavior
Capability-affecting upgrades
Use when capabilities or protocol features change:- Validate negotiated capability behavior (
welcome.caps) - Test mixed-client behavior (old/new SDK sessions)
- Delay enabling new capability-dependent paths until baseline fleet updated
Storage-impacting upgrades
Use when persistence format or archive/snapshot semantics are affected:- Take backup/export artifacts first
- Validate replay/hash invariants before and after upgrade
- Define rollback window before applying irreversible migrations
Preflight checklist
- Record current server version, SDK versions, and deployed config
- Snapshot active runtime flags and environment variable values
- Export backup artifacts (archives/snapshots as appropriate)
- Confirm compatibility expectations in
api-reference/compatibility-and-support - Identify commands/endpoints used by your clients and operator tooling
- Prepare canary environment with representative traffic profile
Rollout checklist
- Record current versions and config snapshot
- Upgrade one canary instance
- Validate:
- WebSocket handshake and sync (
hello->welcome->pack/request) - blob paths (
blob-uploadand/or direct I/O flow) - one command per control-plane family used in production
- WebSocket handshake and sync (
- Run replay/hash validation on canary room data
- Expand rollout in bounded increments
- Monitor close codes, reject reason classes, latency, and error rate
- Promote rollout only after stable observation window
Rollback planning
Define rollback before rollout:- Binary rollback path (previous image/package)
- Config rollback path (flags/env)
- Data rollback constraints (what can and cannot be reversed)
- Trigger thresholds for rollback (error/reject spikes, convergence failures, blob failures)
- Stop rollout immediately
- Revert runtime binary and config
- Verify reconnect and convergence behavior
- Re-run replay/hash checks on impacted rooms
- Document incident and update future preflight gates
Post-upgrade verification
- Confirm expected route surfaces (
/ws/:room_id,/metrics) - Validate command contract behavior for critical workflows
- Verify no sustained spike in:
errorenvelope rates- control-plane rejections
- close codes (
4001,4002,4008,1011)
- Verify persistence and GC lifecycle behavior remains healthy
Common upgrade pitfalls
- Upgrading server without testing capability negotiation with existing clients
- Enabling new command flows before issuing correctly scoped tokens
- Treating storage-impacting changes as binary-only rollouts
- Skipping replay/hash validation after migration steps
- Rolling back binaries without checking persistence compatibility assumptions