Upgrades and migrations

Use this runbook to upgrade server/runtime versions without breaking room correctness or operator safety guarantees.

Upgrade goals

Preserve room history correctness
Maintain client compatibility during rollout
Avoid irreversible storage surprises
Keep rollback options explicit

Upgrade strategy by change type

Patch/minor runtime upgrades

Use for low-risk changes with no known contract break:

Roll canary first
Validate handshake, sync, blob flow, and one control-plane command family
Expand rollout gradually while watching close/reject/error behavior

Capability-affecting upgrades

Use when capabilities or protocol features change:

Validate negotiated capability behavior (welcome.caps)
Test mixed-client behavior (old/new SDK sessions)
Delay enabling new capability-dependent paths until baseline fleet updated

Storage-impacting upgrades

Use when persistence format or archive/snapshot semantics are affected:

Take backup/export artifacts first
Validate replay/hash invariants before and after upgrade
Define rollback window before applying irreversible migrations

Preflight checklist

Record current server version, SDK versions, and deployed config
Snapshot active runtime flags and environment variable values
Export backup artifacts (archives/snapshots as appropriate)
Confirm compatibility expectations in api-reference/compatibility-and-support
Identify commands/endpoints used by your clients and operator tooling
Prepare canary environment with representative traffic profile

Rollout checklist

Record current versions and config snapshot
Upgrade one canary instance
Validate:
- WebSocket handshake and sync (hello -> welcome -> pack/request)
- blob paths (blob-upload and/or direct I/O flow)
- one command per control-plane family used in production
Run replay/hash validation on canary room data
Expand rollout in bounded increments
Monitor close codes, reject reason classes, latency, and error rate
Promote rollout only after stable observation window

Rollback planning

Define rollback before rollout:

Binary rollback path (previous image/package)
Config rollback path (flags/env)
Data rollback constraints (what can and cannot be reversed)
Trigger thresholds for rollback (error/reject spikes, convergence failures, blob failures)

If rollback is needed:

Stop rollout immediately
Revert runtime binary and config
Verify reconnect and convergence behavior
Re-run replay/hash checks on impacted rooms
Document incident and update future preflight gates

Post-upgrade verification

Confirm expected route surfaces (/ws/:room_id, /metrics)
Validate command contract behavior for critical workflows
Verify no sustained spike in:
- error envelope rates
- control-plane rejections
- close codes (4001, 4002, 4008, 1011)
Verify persistence and GC lifecycle behavior remains healthy

Common upgrade pitfalls

Upgrading server without testing capability negotiation with existing clients
Enabling new command flows before issuing correctly scoped tokens
Treating storage-impacting changes as binary-only rollouts
Skipping replay/hash validation after migration steps
Rolling back binaries without checking persistence compatibility assumptions

​Upgrades and migrations

​Upgrade goals

​Upgrade strategy by change type

​Patch/minor runtime upgrades

​Capability-affecting upgrades

​Storage-impacting upgrades

​Preflight checklist

​Rollout checklist

​Rollback planning

​Post-upgrade verification

​Common upgrade pitfalls

​Related pages