How One AWS DNS Failure Cascaded Across Half the Internet

On Oct 20, 2025, AWS’s crown jewel region us-east-1 spent fifteen hours flailing. More than a thousand companies went dark worldwide. The root cause? An internal DNS entry that stopped resolving.

From the moment DNS broke, DynamoDB, EC2, Lambda, and 139 other services quickly degraded. Snapchat, Roblox, Coinbase, Signal, Reddit, Robinhood—gone. Billions evaporated in half a day. This was a cyber earthquake.

The most alarming part: us-east-1 hosts the control plane for every commercial AWS region. Customers running in Europe or Asia still got wrecked, because their control traffic routes through Virginia. A single DNS hiccup turned into a multi-billion dollar blast radius. This isn’t a “skills” problem; it’s architectural hubris. us-east-1 became the nervous system of the internet, and nervous systems seize up.

Cyberquake math: billions torched in hours
#

Catchpoint’s CEO estimated the damage somewhere between “billions” and “hundreds of billions.”

Finance bled first. Robinhood was offline through the entire NYSE session; millions of traders were locked out. Coinbase’s outage froze crypto markets in the middle of volatility. Venmo logged 8,000 outage reports—imagine a whole society losing its wallet mid-day.

Gaming giants cratered. Roblox’s hundred-million DAUs were suddenly ejected. Fortnite, Pokémon GO, Rainbow Six—all silent. For engagement-driven platforms, each downtime hour is permanent churn.

UK government portals, tax systems, customs, banks, and several airlines reported disruptions. Even Amazon’s own empire stopped: amazon.com, Alexa, Ring doorbells, Prime Video, and AWS’s own ticketing tools failed. Turns out even the company that built us-east-1 can’t escape its single point of failure.

Root cause: DNS butterflies
#

11:49 PM PDT, Oct 19: error rates in us-east-1 spiked. AWS didn’t confirm until 22 minutes later, and the mess dragged on until 3:53 PM Oct 20—a 16-hour saga.

AWS’s status post reads like slapstick: an internal DNS lookup failed. That single failure cut DynamoDB off from everything else. DynamoDB underpins IAM, EC2, Lambda, CloudWatch—the entire control plane.

DNS got patched in 3.5 hours, but the backlog triggered a retry storm that clobbered DynamoDB again. EC2, load balancers, and Lambda all depend on DynamoDB, and DynamoDB depends right back on them. The ouroboros locked up. AWS had to manually throttle EC2 launches and Lambda/SQS polling to stop the cascade, then inch the fleet back online.

Cascading amplification: the Achilles heel
#

us-east-1 isn’t just another datacenter—it’s the central nervous system of AWS. Excluding China, GovCloud, and the EU’s sovereign region, every control-plane call funnels through Virginia.

Translation: even if you run workloads in Tokyo or Frankfurt, IAM auth, S3 configuration, DynamoDB global tables, Route 53 updates—all still go to us-east-1. That’s why the UK government, Lloyds Bank, and Canada’s Wealthsimple all went down: invisible dependencies bite just the same.

us-east-1 holds this power because it’s the oldest region. Nineteen years of layers, debt, and special cases piled up. Refactoring it would touch millions of lines, thousands of services, and untold customer assumptions. AWS chose to live with the risk. Incidents like this remind us of that price.

Technical autopsy: how a papercut becomes an ICU stay
#

From 2017 to 2025, every us-east-1 catastrophe exposed the same anti-patterns. Nobody learned.

Aspect	2017 S3 Outage	2020 Kinesis Outage	2025 DNS Outage
Trigger	Human error (fat finger)	Scaling limit (thread caps)	DNS resolution failure
Core service	S3	Kinesis	DynamoDB
Duration	~4 hours	17 hours	16 hours
Cascade path	S3 → EC2 → EBS → Lambda	Kinesis → EventBridge → ECS/EKS → CloudWatch → Cognito	DNS → DynamoDB → IAM → EC2 → NLB → Lambda / CloudWatch
Recovery pain	Massive subsystem restarts	Gradual reboots & routing rebuild	Retry storms, backlog drain, NLB health rebuild
Monitoring blind spots	Service Health Dashboard down	CloudWatch degraded	CloudWatch & SHD impaired
Blast radius	us-east-1 (plus dependents)	us-east-1	Global (IAM/global tables)
Economic impact	$150M for S&P 500 firms	N/A	Billions to hundreds of billions

Looped dependencies, death-spiral edition. AWS microservices are a hairball. IAM, EC2 management, ELB—all lean on DynamoDB; DynamoDB leans on them. Complexity hides inside layers of abstraction, making diagnosis painfully slow. We’ve seen the exact same story at Alicloud, OpenAI, and Didi: circular dependencies kill you the day something hiccups.

Centralized single points of failure. After the 2020 Kinesis collapse, AWS evangelized its “cell-based” design and said it was migrating services to it. Yet us-east-1 still anchors the global control plane. Six availability zones mean nothing when DNS—the ultimate shared service—goes sideways. Multi-region fantasies crumble in the face of one supernode.

Monitoring eating its tail. AWS’s monitoring stacks run on AWS. Datadog does too. When us-east-1 went dark, everything that would’ve sounded the alarm went dark with it. Seventy-five minutes in, the AWS status page still showed “all green.” Not malice—just blindness.

Circuit breakers MIA. AWS preaches breakers everywhere, but internal meshes apparently ignore that advice. Once DynamoDB glitched, every upstream service hammered it harder. The retry wave did more damage than the initial failure. Eventually AWS engineers had to rate-limit systems by hand. “Automate all the things” devolved into babysitting.

Organizational amnesia
#

Ops folks have a meme: It’s always DNS. Any veteran SRE would start there. AWS wandered in the dark for hours, then flailed with manual throttles for five more hours. When you hollow out your expert teams, this is what you get.

Amazon laid off 27k people between 2022 and 2025. Internal docs show “regretted attrition” at 69–81%—meaning most departures were folks the company wanted to keep. The forced return-to-office push drove even more seniors out. Justin Garrison predicted in 2023 that major outages would follow. He was optimistic.

“Regretted attrition” = employees the company didn’t want to lose but lost anyway.

You can’t replace institutional memory. The engineers who knew which microservice relied on which shadow API are gone. New hires don’t have the scar tissue to debug cascading chaos. You can’t document that intuition; it only comes from years of firefights. So the next edge case hits, and the on-call team spends a hundred times longer fumbling toward the fix.

Cloud economist Corey Quinn put it plainly in The Register: “Lay off your best engineers and don’t be shocked when the cloud forgets how DNS works. The next catastrophe is already queued; the only question is which understaffed team trips over which edge case first.”

A colder future: designing for fragility
#

A few months ago a Google IAM outage took down half the internet. Less than half a year later, AWS repeated the feat with DNS. When a single DNS record inside one hyperscaler can disrupt tens of millions of lives, we need to admit the obvious: cloud convenience bought us systemic fragility.

Three U.S. companies control 63% of global cloud infrastructure. That’s not just a tech risk; it’s geopolitical exposure. Convenience vs. concentration is a lose-lose paradox.

Marketing promises “four nines,” “global active-active,” and “enterprise-grade reliability.” Stack AWS/Azure/GCP’s actual outage logs and the myth disintegrates. Cherry Servers’ 2025 report lays out the numbers:

Cloud Provider	Incidents (2024.08–2025.08)	Avg Duration
AWS	38	1.5 hours
Google Cloud	78	5.8 hours
Microsoft Azure	9	14.6 hours

Headline numbers from the study

“Leaving the cloud” used to sound heretical. Now it’s just risk management. Elon Musk’s X (formerly Twitter) ran fine through this AWS outage because it operates its own datacenters. Musk spent the downtime roasting AWS on X. 37signals decided in 2022 to yank Basecamp and HEY off public clouds, projecting eight figures of savings over five years. Dropbox started rolling its own hardware back in 2016. That’s not regression; it’s diversification.

For teams with resources, hybrid deployment makes sense: keep the crown jewels under your control, burst to cloud for elastic needs. Ask whether every workload truly belongs on a hyperscaler. Can your critical systems keep the lights on if the cloud disappears for a day?

Build resilience inside fragility. Maintain autonomy inside dependence. us-east-1 will fail again—not if, but when. The real question is whether you’ll be ready next time.

References
#

AWS: Update – services operating normally

AWS Health: Operational issue – Multiple services (N. Virginia)

HN: AWS multiple services outage in us-east-1

CNN: Amazon says systems are back online after global internet outage

The Register: Brain drain finally sends AWS down the spout

Converge: DNS failure triggers multi-service AWS disruption

Incident log
#

12:11 AM PDT – Investigating elevated error rates and latency across multiple services in us-east-1 (N. Virginia). Next update in 30–45 minutes.

12:51 AM PDT – Multiple services confirmed impacted; Support Center/API also flaky. Mitigations underway.

1:26 AM PDT – Significant errors on DynamoDB endpoints; other services affected. Support ticket creation remains impaired. Engineering engaged; next update by 2:00.

2:01 AM PDT – Potential root cause identified: DNS resolution failures for DynamoDB APIs in us-east-1. Other regional/global services (IAM updates, DynamoDB global tables) also affected. Keep retrying. Next update by 2:45.

2:22 AM PDT – Initial mitigations deployed; early recovery signs. Requests may still fail; expect higher latency and backlogs needing extra time.

2:27 AM PDT – Noticeable recovery; most requests should now succeed. Still draining queues.

3:03 AM PDT – Most impacted services recovering. Global features depending on us-east-1 also coming back.

3:35 AM PDT – DNS issue fully mitigated; most operations normal. Some throttling remains while CloudTrail/Lambda drain events. EC2 launches (and ECS) still see elevated errors; refresh DNS caches if DynamoDB endpoints still misbehave. Next update by 4:15.

4:08 AM PDT – Working through EC2 launch errors (including “insufficient capacity”). Mitigating elevated Lambda polling latency for SQS event-source mappings. Next update by 5:00.

4:48 AM PDT – Still focused on EC2 launches; advise launching without pinning an AZ so EC2 can pick healthy zones. Impact extends to RDS, ECS, Glue. Auto Scaling groups should span AZs. Increasing Lambda polling throughput for SQS; AWS Organizations policy updates also delayed. Next update by 5:30.

5:10 AM PDT – Lambda event-source polling for SQS restored; draining queued messages.

5:48 AM PDT – Progress on EC2 launches; some AZs can start new instances. Rolling mitigations to remaining AZs. EventBridge and CloudTrail backlogs continue to drain without new delays. Next update by 6:30.

6:42 AM PDT – More mitigations applied, but EC2 launch errors remain high. Throttling new launches to aid recovery. Next update by 7:30.

7:14 AM PDT – Significant API and network issues confirmed across multiple services. Investigating; update within 30 minutes.

7:29 AM PDT – Connectivity problems impacting multiple services; early recovery signals observed while root cause analysis continues.

8:04 AM PDT – Still tracing connectivity issues (DynamoDB, SQS, Amazon Connect, etc.). Narrowed to EC2’s internal network. Mitigation planning underway.

8:43 AM PDT – Further narrowed: internal subsystem monitoring Network Load Balancer (NLB) health is misbehaving. Throttling EC2 launches to help recovery.

9:13 AM PDT – Additional mitigation deployed; NLB health subsystem shows recovery. Connectivity and API performance improving. Planning next steps to relax EC2 launch throttles. Next update by 10:00.

10:03 AM PDT – Continuing NLB-related mitigations; network connectivity for most services improving. Lambda invocations still erroring when creating new execution environments (including Lambda@Edge). Validating an EC2 launch fix to roll out zone by zone. Next update by 10:45.

10:38 AM PDT – EC2 launch fix progressing; some AZs show early recovery. Rolling out to remaining zones should resolve launch and connectivity errors. Next update by 11:30.

11:22 AM PDT – Recovery continues; more EC2 launches succeed, connectivity issues shrink. Lambda errors dropping, especially for cold starts. Next update by noon.

12:15 PM PDT – Broad recovery observed. Multiple AZs launching instances successfully. Lambda functions calling other services may still see intermittent errors while network issues clear. Lambda-SQS polling was reduced earlier; rates now ramping back up. Next update by 1:00.

1:03 PM PDT – Continued improvement. Further reducing throttles on new EC2 launches. Lambda invocation errors fully resolved; event-source polling for SQS restored to pre-incident levels. Next update by 1:45.

1:52 PM PDT – EC2 throttles continue to ease across all AZs; ECS/Glue etc. recover as launches succeed. Lambda is healthy; queued events should clear within ~two hours. Next update by 2:30.

2:48 PM PDT – EC2 launch throttling back to normal; residual services wrapping up recovery. Incident closed by 3:53 PM.

Extra reading
#

Outage & failure stories
#

Cloud economics & resources
#

The “leave the cloud” chronicles
#

What Can We Learn from Alibaba-Cloud's Global Outage?

2023-11-13·2789 words·14 mins

Cloud-Exit Alibaba-Cloud Cloud-Outage

Alibaba-Cloud’s epic global outage after Double 11 set an industry record. How should we evaluate this incident, and what lessons can we learn from it?

OpenAI Global Outage Postmortem: K8S Circular Dependencies

2024-12-14·1937 words·10 mins

Cloud-Outage

Even trillion-dollar unicorns can be a house of cards when operating outside their core expertise.

Alibaba-Cloud: High Availability Disaster Recovery Myth Shattered

2024-09-17·2389 words·12 mins

Cloud-Exit Alibaba-Cloud Cloud-Outage

Seven days after Singapore Zone C failure, availability not even reaching 8, let alone multiple 9s. But compared to data loss, availability is just a minor issue

What Can We Learn from NetEase Cloud Music's Outage?

2024-08-18·784 words·4 mins

Cloud-Outage

NetEase Cloud Music experienced a two-and-a-half-hour outage this afternoon. Based on circulating online clues, we can deduce that the real cause behind this incident was…

How Ahrefs Saved US$400M by NOT Going to the Cloud

2024-05-22·2167 words·11 mins

Cloud-Exit AWS EBS

After Alibaba-Cloud’s epic global outage on Double 11, setting industry records, how should we evaluate this incident and what lessons can we learn from it?

Database Deletion Supreme - Google Cloud Nuked a Major Fund's Entire Cloud Account

2024-05-11·534 words·2 mins

Cloud-Exit Cloud-Outage

Due to an “unprecedented configuration error,” Google Cloud mistakenly deleted trillion-RMB fund giant UniSuper’s entire cloud account, cloud environment and all off-site backups, setting a new record in cloud computing history!

Cyberquake math: billions torched in hours#

Root cause: DNS butterflies#

Cascading amplification: the Achilles heel#

Technical autopsy: how a papercut becomes an ICU stay#

Organizational amnesia#

A colder future: designing for fragility#

References#

Incident log#

Extra reading#

Outage & failure stories#

Cloud economics & resources#

The “leave the cloud” chronicles#

Related