When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World

2026-05-105 min read

For most of the cloud era, "high availability" has meant designing around hardware and software faults: a rack dies, an availability zone goes dark, a deploy goes wrong. Multi-AZ architectures, regional load balancing, and runbooks for failover have been our defenses. In this InfoQ article, Rohan Vardhan, a senior software engineer at Meta, argues that this model is no longer sufficient. Cloud regions aren't just technical constructs — they sit inside political and legal jurisdictions. And in a world of sanctions, internet shutdowns, and aggressive data-localization laws, entire regions can disappear for reasons that no amount of multi-AZ redundancy can mitigate.

The Threat Model Has Changed

Vardhan's central observation is that the failure modes we design around have quietly multiplied. Your standard list might be:

  • Hardware failure — disk, NIC, server, rack.
  • Network partition — a switch, a fiber cut, a misrouted BGP advertisement.
  • Software bug — a bad deploy, a config rollout gone wrong.

But to that list we now have to add a category that doesn't fit on a status page:

  • Sanctions — a vendor is legally required to terminate service in your region.
  • Internet shutdowns — a state-mandated network blackout cuts off entire regions.
  • Data-localization laws — replication that used to "just work" becomes illegal overnight.
  • Physical conflict — kinetic events damage infrastructure and disrupt access.

Critically, these failures cut across availability zones. A region in a sanctioned country isn't partially unavailable — every AZ in it goes dark at once. Multi-AZ design, the workhorse of cloud resilience, offers zero protection.

Sovereign Fault Domains: A New Primitive

The article introduces a useful concept: the sovereign fault domain. Where an availability zone is a physical failure boundary, a sovereign fault domain is a political/legal one — defined by the jurisdiction whose laws can render a region unavailable.

The mapping is direct:

  • Sanctions → forced removal of dependencies.
  • Internet shutdowns → durable, large-scale network partitions.
  • Data localization → replication constraints that prevent normal failover.
  • Physical conflict → infrastructure damage and access loss.

Treating sovereign fault domains as first-class failure boundaries forces a different conversation: not "can we tolerate an AZ outage?" but "can we tolerate losing every AZ in a jurisdiction simultaneously, possibly without warning, and possibly without ever getting them back?"

A Concrete Case Study: Russia, 2022

The article's most vivid example is the 2022 withdrawal of major cloud providers — AWS, Microsoft, Google, IBM — from Russia following sanctions. From a technical standpoint, nothing was wrong. From a legal standpoint, every region those providers operated in Russia was forced offline, abruptly, for compliant customers.

Architectures that assumed graceful, voluntary failover had no good answer. There was no slow degradation to autoscale around. There was no bug to roll back. The region was politically gone, and any system whose disaster recovery story relied on "we'll bring it back up in a few hours" was simply wrong.

The lesson: regional boundaries are not always independent — they can be correlated through jurisdiction. An outage in one cloud region in one country can correlate with outages in every region of that country, all at once.

Architectural Recommendations

Vardhan's recommendations are pragmatic and aimed at making sovereign resilience the default for critical systems.

1. Multi-Region as a Baseline

Multi-AZ is no longer enough for systems whose continuity matters. Critical workloads should be designed to fail over across regions, ideally across sovereign jurisdictions. Treat regional boundaries as potentially correlated when they share a jurisdiction; treat truly independent jurisdictions as the meaningful failure boundary.

2. Geopolitical RTO/RPO

Define Recovery Time Objectives and Recovery Point Objectives for sovereign disruption scenarios — and make them explicit, separate from your hardware-driven RTO/RPO. "How long can we tolerate losing this region if it never comes back?" is a very different question from "How long can we tolerate an AZ outage?"

3. Pre-Built Evacuation Playbooks

Decide before a crisis how you'd:

  • Re-home traffic to another jurisdiction.
  • Migrate or re-replicate data within legal constraints.
  • Switch identity, DNS, and credential systems away from a sovereign domain.
  • Communicate with users and regulators in both the affected and receiving jurisdictions.

If your evacuation plan is invented during the incident, it's already too late.

4. Chaos Engineering for Sovereign Failures

Extend chaos engineering beyond "kill an instance" or "isolate an AZ." Simulate:

  • Loss of an entire jurisdiction's control plane.
  • Blocked cross-region traffic to a specific country.
  • Sudden inability to replicate data outside a jurisdiction.

The goal is the same as classic chaos engineering: surface assumptions before reality does.

Risk assessments should explicitly include legal and geopolitical change. A change in data residency law is a production incident waiting to happen if your replication topology depends on the old rules.

The Broader Implication

The most important shift in Vardhan's framing is conceptual: resilience is no longer a purely technical discipline. Reliability engineering now intersects with international law, sanctions policy, and geopolitical risk in ways that were marginal a decade ago.

Practically, that means:

  • Architects need a working understanding of jurisdictional risk, not just availability metrics.
  • DR playbooks should be reviewed by legal and policy stakeholders, not just SREs.
  • Vendor selection should consider sovereign exposure, not just SLAs and pricing.
  • "Regions" should be modeled in the architecture as members of jurisdictions, not as independent units.

Final Takeaway

The world has changed faster than our HA models. Hardware is no longer the most likely cause of region-scale unavailability — jurisdiction is. The honest response is to treat sovereign fault domains as a first-class failure boundary, design multi-region (not just multi-AZ) systems for the workloads that matter, build evacuation playbooks before they're needed, and chaos-test against the kinds of failures that don't show up on a vendor status page.

If your continuity story still ends at "we run in three AZs," it's time to extend it. The next region you lose might not come back.


Reference: When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World