Cypher-X

Running a large-scale distributed database in production is hard. Upgrading it — across more than 1,000 nodes, spanning multiple datacenters, while serving live production traffic — is an entirely different challenge. That is exactly what Yelp's Database Reliability Engineering (DRE) team accomplished with their recent zero-downtime Apache Cassandra upgrade. Their story is a compelling lesson in disciplined engineering, deep automation, and the kind of rigorous observability that turns a potentially catastrophic event into a routine operation.

Why This Is Hard

Unlike stateless services, Cassandra is deeply stateful. Data is distributed and replicated across nodes using a consistent hashing ring. Nodes form a gossip-based cluster and must remain interoperable throughout any upgrade process. A misstep — taking down too many nodes at once, introducing a protocol incompatibility, or failing to detect a silent data inconsistency — can cascade into degraded availability or, in the worst case, data loss.

Blue-green deployments, the go-to strategy for stateless services, are impractical for a database of this scale. Running a second, full-scale Cassandra cluster in parallel while keeping data in sync would be prohibitively expensive and operationally complex. Yelp instead chose a rolling in-place upgrade, leaning entirely on Cassandra's native backward compatibility guarantees and their own engineering discipline.

Preparation: Getting the Cluster Ready

Before a single node was touched in production, the DRE team invested heavily in preparation.

Schema Consistency

All keyspaces and tables were audited to ensure they were running a consistent schema version across every replica. Schema disagreements between nodes can cause subtle issues during rolling upgrades, so this was a prerequisite, not an afterthought.

Full Repair

Cassandra relies on anti-entropy repair to bring replicas back in sync after periods of inconsistency (such as when a node was temporarily unavailable). Before the upgrade began, the team ran a full major repair across all clusters. This ensured every replica held an up-to-date copy of all data, reducing the risk that a node restarting mid-upgrade would serve stale reads.

Snapshots and Backups

As a final safety net, full snapshots were taken of all keyspaces. In a disaster scenario, these snapshots provide a rollback path to the pre-upgrade state.

Staging Validation

The entire upgrade was rehearsed in a non-production staging environment that mirrored the production cluster topology. This allowed the team to surface tooling bugs, unexpected compatibility issues, and edge cases before they could affect real users.

The Rolling Upgrade Strategy

Cassandra explicitly supports rolling upgrades — the protocol is designed so that upgraded and non-upgraded nodes can coexist in the same cluster for the duration of the upgrade. Yelp exploited this guarantee to upgrade nodes incrementally, validating each step before proceeding.

Rack-Aware Sequencing

Cassandra's replication is topology-aware. Data is replicated across racks (and datacenters) to ensure that a full rack failure does not cause data unavailability. To maintain this guarantee during the upgrade, Yelp's automation ensured that at most one node per rack was ever offline at the same time. This meant that for any given partition key, at least a quorum of replicas remained available at all times.

Per-Datacenter Staggering

Upgrades were staggered across datacenters. One datacenter was upgraded fully and validated before the next was touched. This limited the blast radius of any regression to a single datacenter and provided an opportunity to observe the impact of the upgrade on real traffic patterns before rolling it out more broadly.

Consistency Level Tolerance

Applications reading from and writing to Cassandra use quorum-based consistency levels. A quorum read or write succeeds as long as a majority of replicas respond. By ensuring no more than one node per rack was offline during the upgrade, the cluster always maintained quorum, so client applications experienced no failures.

Automation: The Upgrade Orchestrator

Manually upgrading 1,000 nodes is not feasible. Yelp built a purpose-built upgrade orchestration system that automated the per-node upgrade lifecycle:

Drain the node — Flush all in-memory data (memtables) to disk and stop accepting new write coordination requests. This is a graceful handoff; the node finishes in-flight requests before yielding.
Stop the Cassandra service — The node is taken offline cleanly.
Upgrade the packages — Install the new Cassandra version via the internal package repository, along with any dependent changes (JVM version, sidecar agents, OS patches).
Restart Cassandra — Bring the node back up on the new version.
Health checks — Confirm gossip is converged, the native transport is accepting connections, and key metrics (heap utilization, GC pause times, pending compactions) are within normal bounds.
Re-enable traffic — Allow the node to resume serving coordinator and replica traffic.
Bake-in period — Monitor the newly upgraded node for a configurable window before proceeding to the next node. This detects regressions that only manifest under real production load.

If any health check fails or a metric breaches a threshold during the bake-in period, the automation halts and pages the on-call DRE engineer. Critically, the orchestrator also supported automated rollback — if a newly upgraded node exhibited immediate problems, it could be rolled back to the previous version without requiring manual intervention.

Observability: Seeing Everything

No zero-downtime upgrade story is complete without a discussion of observability. Yelp's approach to cluster visibility was a foundational part of what made this upgrade possible.

Node-Level Metrics

For each node, the team tracked:

Heap utilization and GC pause frequency — A sudden spike in garbage collection activity often signals memory pressure from the new version.
Disk space and pending compactions — Upgrades can trigger compaction strategies to shift, and a backlog of pending compactions is an early warning sign of disk pressure.
Error rates — Read, write, and coordinator error counts are sensitive indicators of any behavioral regression introduced by the upgrade.

Cluster-Level Metrics

Beyond individual nodes, the team monitored the health of the cluster as a whole:

Read and write latencies (p50, p99, p999) — Latency regressions are often the first visible sign that something has gone wrong.
Dropped mutations — Mutations are dropped when a node is overloaded and cannot process incoming writes fast enough. Any increase here during the upgrade warranted immediate investigation.
Coordinator statistics — The coordinator node for a request plays a key role in reading from and writing to the correct replicas. Coordinator-level errors indicate client-visible failures.

Upgrade Progress Dashboards

Custom dashboards gave the DRE team a real-time view of which nodes had been upgraded, which were currently in progress, and which were pending. This allowed multiple team members to observe the upgrade state simultaneously and catch anomalies early.

Alerting with Tight Thresholds

During the upgrade window, alerting thresholds were deliberately tightened. Alerts that would normally be informational became pages. The philosophy was simple: it is far better to be woken up for a false alarm than to miss an early signal that the upgrade was silently degrading the cluster.

Lessons Learned

Yelp's experience surfaced a number of broadly applicable lessons for teams considering similar upgrades.

Automation is not optional at this scale. A 1,000-node rolling upgrade cannot be executed safely by humans following a runbook. The orchestration system was not just a convenience — it was the only way to ensure the upgrade was consistent, repeatable, and recoverable across every node in every cluster.

Observability must be built before the upgrade begins. Instrumenting the cluster during the upgrade, when you are already operating under stress, is too late. Yelp's ability to catch regressions early depended on monitoring infrastructure that had been in place for months, with well-understood baselines for every metric.

In-place upgrades beat blue-green for stateful systems. For databases at Yelp's scale, the operational cost of running a parallel cluster and synchronizing data in real time would have been enormous. Rolling upgrades, while slower, are more resource-efficient and, with the right tooling, just as safe.

Test in staging — exhaustively. The investment in a production-mirroring staging environment caught issues that would have been disruptive at production scale. Running the full upgrade end-to-end in staging before touching production paid dividends in engineer confidence and reduced incident risk.

Rack-aware sequencing is non-negotiable. Respecting Cassandra's topology during the upgrade is what preserved availability. Any deviation from the one-node-per-rack rule would have put replication quorums at risk.

Conclusion

Yelp's zero-downtime Cassandra upgrade across 1,000+ nodes is a testament to what is achievable when an engineering team combines deep system knowledge with rigorous automation and comprehensive observability. The upgrade required no parallel cluster, no prolonged maintenance window, and no user-facing downtime — just meticulous preparation, disciplined sequencing, and tooling built to catch problems before they become incidents.

For any organization operating Cassandra at scale, or considering a major database version upgrade, Yelp's story provides a concrete and detailed playbook: repair before you start, automate every step, respect your topology, watch everything, and halt at the first sign of trouble.

Reference: Zero Downtime Upgrade: Yelp's Cassandra Upgrade Story