Metrics and observability
NodalMerge exposes operational metrics and structured logs designed for production triage. This page focuses on:- Enabling metrics safely
- Understanding key metric families
- Building a practical triage workflow
Enable metrics endpoint
Metrics are exposed from a dedicated admin listener:http://127.0.0.1:9090/metrics
Logging baseline
Server logging usesRUST_LOG/EnvFilter.
Typical baseline:
infofor normal operation- targeted
debug/tracefor incident investigation
Core metric families
Capacity and occupancy
nodalmerge_rooms_totalnodalmerge_peers_total{room}nodalmerge_room_bytes_resident
Sync and ingest health
nodalmerge_nodes_accepted_total{room}nodalmerge_merge_batch_secondsnodalmerge_filtered_nodes_totalnodalmerge_filtered_bytes_totalnodalmerge_filtered_pack_dropped_total
Persistence performance
nodalmerge_persistence_write_seconds{kind=node|nodes_batch|blob}
Lifecycle and GC
nodalmerge_eviction_totalnodalmerge_blob_gc_deleted_total{room}nodalmerge_gc_runs_totalnodalmerge_gc_marked_totalnodalmerge_gc_error_total
Safety and guardrail events
nodalmerge_broadcast_lagged_totalnodalmerge_ws_send_timeout_totalnodalmerge_rate_limit_drops_totalnodalmerge_lamport_rejected_total{reason}nodalmerge_token_expired_disconnects_total
Topology/query operational metrics
NodalMerge also emits metrics for topology promotion and query/projection pipelines (inflight, queue depth, queue wait, outcome totals). Track these if you run manager/worker or query-heavy workloads.Recommended dashboard slices
Start with four dashboards:- Traffic and occupancy: rooms, peers, accepted nodes
- Latency and persistence: merge/persistence histograms
- Lifecycle and GC: eviction + GC run/error/delete trends
- Guardrail events: lagged peers, rate-limit drops, token-expiry disconnects, lamport rejects
Alerting suggestions
Good first alerts:- Sustained increase in
nodalmerge_merge_batch_secondsp95/p99 - Persistent growth in
nodalmerge_persistence_write_secondsp95/p99 - Non-trivial
nodalmerge_gc_error_totalincrease - Sudden spikes in
nodalmerge_rate_limit_drops_totalornodalmerge_broadcast_lagged_total - Unexpected rise in
nodalmerge_lamport_rejected_total
Incident triage workflow
When sync health degrades:- Check occupancy and throughput (
rooms_total,peers_total, accepted nodes) - Check merge and persistence latencies
- Check guardrail counters (lagged/rate-limit/timeout/reject)
- Correlate with logs for affected room/peer prefixes
- Validate whether lifecycle jobs (eviction/GC) coincided with onset
Cardinality and labeling hygiene
Labels are intentionally bounded in core metrics. Operator guidance:- Avoid adding unbounded high-cardinality labels in custom instrumentation
- Prefer peer prefixes or room-level aggregation over full identities
- Build drill-down dashboards separately from overview dashboards
Common observability mistakes
- Exposing metrics endpoint publicly
- Running permanently at debug/trace verbosity
- Alerting on raw counters without rate windows
- Ignoring guardrail counters because they “self-heal”
- Treating room-level outliers as global regressions without segmentation