Skip to main content
Background Image
  1. Cloud Exit Mudslide/

OpenAI Global Outage Postmortem: K8S Circular Dependencies

·1943 words·10 mins· ·
Ruohang Feng
Author
Ruohang Feng
Pigsty Founder, @Vonng
Table of Contents

On December 11th, OpenAI experienced a global service outage affecting ChatGPT, API, Sora, Playground, and Labs. The outage lasted from 3:16 PM to 7:38 PM PT, spanning over four hours with significant impact.

According to OpenAI’s incident report published afterward, the root cause was a newly deployed monitoring service that overwhelmed the Kubernetes control plane. The control plane failure then prevented direct rollback, amplifying the impact and causing extended unavailability.

This incident bears striking similarity to last year’s Alibaba Cloud epic global outage. Both involved global control plane failures caused by circular dependencies (plus insufficient testing/deployment gradual rollout). The difference: Alibaba’s was between OSS and IAM, OpenAI’s was between DNS and K8S.


Circular dependencies are architectural poison—like placing dynamite in your infrastructure foundation, easily triggered by temporary or sporadic failures. This incident serves as another wake-up call. The root cause goes deeper than testing/gradual rollout issues—it’s architectural juggling:

Kubernetes officially recommends maximum cluster size of 5,000 nodes, yet I clearly remember OpenAI’s boastful article: “How We Scaled Kubernetes to 7,500 Nodes by Removing One Component.” Not only did they eliminate redundancy, they pushed 50% beyond recommended limits. Ultimately, they did indeed crash due to cluster scale issues.


OpenAI is the AI industry’s darling, with undeniable product strength and popularity. This cannot mask their infrastructure weaknesses—reliable infrastructure is genuinely difficult, which is why companies like AWS and DataDog are printing money.

OpenAI also had a major outage last year with PostgreSQL database and pgBouncer connection pooling. Their infrastructure reliability track record over these two years hasn’t been impressive. This incident again proves that even trillion-dollar unicorns can be a house of cards when operating outside their core expertise.


Further Reading
#

What We Can Learn from Alibaba Cloud’s Epic Failure

Tencent Cloud: Face-losing Amateur Hour

Dark Forest: Bankrupting AWS Bills with Just an S3 Bucket Name

Unparalleled Database Deletion: Google Cloud Nuked an Entire Fund Account

Global Windows Blue Screen: Both Sides Are Amateur Operations

Alibaba Cloud: Death of High Availability Disaster Recovery Myth

Amateur Hour Show: Alibaba Cloud RDS Failure Chronicle

The Amateur Operations Behind Internet Outages

Should Databases Run in Kubernetes?


Original Incident Report
#

Issues with API, ChatGPT, and Sora
#

https://status.openai.com/incidents/ctrsv3lwd797

OpenAI Incident Report
#

This document provides a detailed account of an incident that occurred on December 11, 2024, during which all OpenAI services experienced significant outages. The root cause was the deployment of a new telemetry service that unexpectedly overwhelmed the Kubernetes control plane, triggering cascading failures across critical systems. We’ll dive into the fundamental causes, outline our response steps, and share the improvements we’re implementing to prevent similar incidents.


Impact
#

Between 3:16 PM and 7:38 PM PST on December 11, 2024, all OpenAI services experienced significant degradation or complete unavailability. This incident originated from a new telemetry service configuration rollout across all clusters and was not caused by security vulnerabilities or recent product releases. Starting at 3:16 PM, all products experienced significant performance degradation.

  • ChatGPT: Began substantial recovery around 5:45 PM and fully recovered at 7:01 PM.
  • API: Began substantial recovery around 5:36 PM, with all models fully recovered by 7:38 PM.
  • Sora: Fully recovered at 7:01 PM.

Root Cause
#

OpenAI operates hundreds of Kubernetes clusters globally. The Kubernetes control plane primarily handles cluster management, while the data plane runs actual workloads (like model inference services).

To improve organizational reliability, we’ve been enhancing cluster-level observability tools to increase visibility into system operations. At 3:12 PM PST, we deployed a new telemetry service across all clusters to collect detailed metrics from Kubernetes control planes.

Due to the telemetry service’s broad operational scope, the new service configuration inadvertently caused all nodes in every cluster to execute expensive Kubernetes API operations that scaled exponentially with cluster size. Thousands of nodes simultaneously making these high-load requests overwhelmed the Kubernetes API servers, crippling the control planes of large clusters. The issue was most severe in our largest clusters, preventing detection in test environments; additionally, DNS caching reduced problem visibility in production until the issue spread throughout clusters.

Although Kubernetes data planes can largely operate independently of control planes, data plane DNS resolution depends on the control plane—if the control plane fails, services cannot communicate via DNS.

In summary, the new telemetry service configuration unexpectedly generated massive Kubernetes API load in large clusters, crashing control planes and disrupting DNS service discovery.


Testing and Deployment
#

We tested the change in a staging cluster without detecting any issues. The failure primarily affected clusters above a certain size; combined with per-node DNS caching delaying failure visibility, the change didn’t reveal obvious anomalies before widespread deployment in production.

Before deployment, our primary concern was the new telemetry service’s resource consumption (CPU/memory). We assessed resource usage across all clusters pre-deployment to ensure new deployments wouldn’t interfere with running services. While we tuned resource requests for different clusters, we didn’t consider Kubernetes API server load. Meanwhile, change monitoring focused on the service’s health status without comprehensive cluster health monitoring (especially control plane health).

Kubernetes data planes (handling user requests) are designed to continue operating when control planes are offline. However, Kubernetes API servers are crucial for DNS resolution, which is a core dependency for many services.

DNS caching provided temporary buffering during early failure stages, allowing stale but usable DNS records to continue providing address resolution for services. Over the next 20 minutes, these caches gradually expired, causing services dependent on real-time DNS to fail. This time lag exposed problems gradually as deployment continued, making the eventual failure scope more concentrated and obvious. Once DNS caches expired, all cluster services made new DNS requests, further burdening the control plane and making recovery difficult in the short term.


Resolution
#

In most cases, monitoring deployments and rolling back problematic changes is relatively straightforward, and we have automated tools to detect and roll back faulty deployments. During this incident, our detection tools worked correctly—alerting engineers minutes before customer impact. However, actually fixing the problem required deleting the problematic telemetry service, which required accessing the Kubernetes control plane. With API servers unable to handle management operations under massive load, we couldn’t immediately remove the faulty service.

We confirmed the problem within minutes and immediately initiated multiple workflows attempting different approaches to quickly restore clusters:

  1. Scale down cluster size: Reduce total Kubernetes API load by decreasing node count.
  2. Block network access to Kubernetes management API: Prevent new high-load requests, giving API servers time to recover.
  3. Scale up Kubernetes API servers: Increase available resources to handle request backlogs, creating an operational window to remove the faulty service.

We simultaneously employed all three methods, eventually restoring access to some control planes, enabling us to delete the problematic telemetry service.

Once we restored access to some control planes, the system began rapidly improving. Where possible, we switched traffic to healthy clusters while further repairing other problematic clusters. Some clusters still experienced resource contention during recovery: many services simultaneously attempting to re-download required components, causing resource saturation requiring manual intervention.

This incident resulted from multiple systems and processes interacting and failing simultaneously:

  • Test environments failed to capture the new configuration’s impact on Kubernetes control planes.
  • DNS caching created delayed service failures, allowing widespread change deployment before full failure exposure.
  • Inability to access control planes during failure made recovery extremely slow.

Timeline
#

  • December 10, 2024: New telemetry service deployed to staging cluster, tested without issues.
  • December 11, 2024 2:23 PM: Code introducing the service merged to main branch, triggering deployment pipeline.
  • 2:51 PM to 3:20 PM: Change gradually applied to all clusters.
  • 3:13 PM: Alerts triggered, notifying engineers.
  • 3:16 PM: Small number of customers began experiencing impact.
  • 3:16 PM: Root cause confirmed.
  • 3:27 PM: Engineers began migrating traffic from affected clusters.
  • 3:40 PM: Customer impact peaked.
  • 4:36 PM: First cluster recovered.
  • 7:38 PM: All clusters recovered.

Prevention Measures
#

To prevent similar incidents, we’re implementing the following measures:

1. More Robust Staged Deployment
#

We will continue strengthening staged deployment and monitoring mechanisms for infrastructure changes, ensuring any failures are quickly detected and contained to smaller scopes. All future infrastructure-related configuration changes will use more comprehensive staged deployment processes with continuous monitoring of service workloads and Kubernetes control plane health during deployment.

2. Fault Injection Testing
#

Kubernetes data planes need further enhancement for survival without control planes. We will introduce testing for this scenario, including intentionally injecting “misconfigurations” in test environments to verify system detection and rollback capabilities.

3. Emergency Kubernetes Control Plane Access
#

We currently lack emergency mechanisms for accessing API servers when data planes put excessive pressure on control planes. We plan to establish “break-glass” mechanisms ensuring engineering teams can access Kubernetes API servers under any circumstances.

4. Further Decouple Kubernetes Data and Control Planes
#

Our current dependency on Kubernetes DNS services creates coupling between data and control planes. We will invest more effort in making control planes non-critical for essential services and product workloads, reducing single-point dependency on DNS.

5. Faster Recovery Speed
#

We will introduce more comprehensive caching and dynamic rate limiting for critical resources required for cluster startup, regularly conducting “rapid entire cluster replacement” drills to ensure correct, complete startup and recovery in minimum time.


Conclusion
#

We sincerely apologize to all customers affected by this incident—whether ChatGPT users, API developers, or enterprises relying on OpenAI products. This incident fell short of our own expectations for system reliability. We recognize the critical importance of providing highly reliable services to all users and will prioritize implementing the above prevention measures while continuously improving service reliability. Thank you for your patience during this outage.

Published 23 hours ago. December 12, 2024 - 17:19 PST


Resolved

Between 3:16 PM and 7:38 PM on December 11, 2024, OpenAI services were unavailable. Starting around 5:40 PM, we observed gradual API traffic recovery; ChatGPT and Sora recovered around 6:50 PM. We resolved the issue at 7:38 PM and restored all services to normal operation.

OpenAI will conduct a complete root cause analysis of this incident and share follow-up details on this page.

December 11, 2024 - 22:23 PST


Monitoring

API, ChatGPT, and Sora traffic has largely recovered. We will continue monitoring to ensure the issue is completely resolved.

December 11, 2024 - 19:53 PST


Update

We are continuing recovery efforts. API traffic is recovering, and we’re restoring ChatGPT traffic region by region. Sora has begun partial recovery.

December 11, 2024 - 18:54 PST


Update

We are working to fix the issue. API and ChatGPT have partially recovered; Sora remains offline.

December 11, 2024 - 17:50 PST


Update

We are continuing to develop recovery solutions.

December 11, 2024 - 17:03 PST


Update

We are continuing to develop recovery solutions.

December 11, 2024 - 16:59 PST


Update

We have found a viable recovery solution and are beginning to see some traffic successfully returning. We will continue working to restore services as quickly as possible.

December 11, 2024 - 16:55 PST


Update

ChatGPT, Sora, and API remain unavailable. We have identified the issue and are deploying a fix. We are working to restore services as quickly as possible and sincerely apologize for the outage impact.

December 11, 2024 - 16:24 PST


Identified Issue

We’ve received reports of API call errors and login issues with platform.openai.com and ChatGPT. We have confirmed the issue and are working on a fix.

December 11, 2024 - 15:53 PST


Update

We are continuing to investigate this issue.

December 11, 2024 - 15:45 PST


Update

We are continuing to investigate this issue.

December 11, 2024 - 15:42 PST


Investigating

We are currently investigating this issue and will provide more updates soon.

Published 2 days ago. December 11, 2024 - 15:17 PST

This incident affected API, ChatGPT, Sora, Playground, and Labs.

Related

Alibaba Cloud: High Availability Disaster Recovery Myth Shattered
·2422 words·12 mins
Seven days after Singapore Zone C failure, availability not even reaching 8, let alone multiple 9s. But compared to data loss, availability is just a minor issue
How Ahrefs Saved US$400M by NOT Going to the Cloud
·2168 words·11 mins
After Alibaba Cloud’s epic global outage on Double 11, setting industry records, how should we evaluate this incident and what lessons can we learn from it?
Technical Minimalism: Just Use PostgreSQL for Everything
·988 words·5 mins
Whether production databases should be containerized remains a controversial topic. From a DBA’s perspective, I believe that currently, putting production databases in Docker is still a bad idea.
7 Databases in 7 Weeks (2025)
·2114 words·10 mins
Is PostgreSQL the king of boring databases? Which databases show promise and punch in 2025?
Self-Hosting Supabase on PostgreSQL
·5277 words·25 mins
Supabase is great, own your own Supabase is even better. A tutorial for self-hosting production-grade supabase on local/cloud/ VM/BMs.
Modern Hardware for Future Databases
·2964 words·14 mins
This article is a comprehensive review of how hardware developments affect database design, introducing key hardware advances in networking, storage, and computing.