Alibaba Cloud: High Availability Disaster Recovery Myth Shattered

Table of Contents

On September 10, 2024, Alibaba Cloud’s Singapore Availability Zone C data center experienced a fire caused by lithium battery explosion. It’s been a week now and services have not been fully restored yet. According to the monthly SLA availability calculation (7 days+/30 days≈75%), service availability isn’t even reaching one 8, let alone multiple 9s, and it’s still declining. Of course, availability at 88% or 99% is now a minor issue — the real question is: can the data stored in a single availability zone be recovered?

As of 09-17, key services like ECS, OSS, EBS, NAS, RDS are still in abnormal state

Generally speaking, if it’s just a small-scale fire in the data center, the problem wouldn’t be particularly severe since power and UPS are usually placed in separate rooms, isolated from server rooms. But once fire suppression sprinklers are triggered, the problem becomes serious: once servers experience a comprehensive power outage, recovery time is basically measured in days; If flooded, it’s not just about availability anymore, but whether data can be recovered — a matter of data integrity.

According to current announcements, a batch of servers were removed on the evening of the 14th and were being dried, still not completed by the 16th. From this “drying” description, there’s a high probability of water damage. Although we cannot definitively state the facts before any official announcement, based on common sense, data consistency damage is highly probable — it’s just a matter of how much is lost. So the impact of this Singapore fire incident is estimated to be on the same scale or even larger than the Hong Kong PCCW data center major failure at the end of 2022 and the Double 11 global unavailability failure in 2023.

Natural disasters and human misfortunes are unpredictable. The probability of failures won’t drop to zero. What’s important is what experience and lessons we can learn from these failures?

Disaster Recovery Performance
#

Having an entire data center catch fire is very unfortunate. Most users can only rely on off-site cold backups to survive, or simply accept their bad luck. We could discuss whether lithium batteries or lead-acid batteries are better, or how UPS power should be laid out, but blaming Alibaba Cloud for these issues is meaningless.

What is meaningful is: in this availability zone level failure, how did products that claim to be “cross-availability zone disaster recovery high availability”, such as RDS cloud database, actually perform? Failures give us an opportunity to verify the disaster recovery capabilities of these products with real performance.

The core metrics of disaster recovery are RTO (Recovery Time Objective) and RPO (Recovery Point Objective), not some multiple 9s availability — the logic is simple: you can achieve 100% availability just by luck without failures. But what truly tests disaster recovery capability is the recovery speed and effectiveness after disasters occur.

Configuration Strategy	RTO	RPO
Single + Do nothing	Data permanently lost, unrecoverable	All data lost
Single + Basic backup	Depends on backup size & bandwidth (hours)	Data after last backup lost (hours to days)
Single + Basic backup + WAL archiving	Depends on backup size & bandwidth (hours)	Last unarchived data lost (tens of MB)
Primary-Secondary + Manual failover	Ten minutes	Data in replication lag lost (about 100KB)
Primary-Secondary + Automatic failover	Within one minute	Data in replication lag lost (about 100KB)
Primary-Secondary + Automatic failover + Synchronous commit	Within one minute	No data loss

After all, the multiple 9s availability metrics specified in SLA are not actual historical performance, but promises to compensate with XX yuan vouchers if this level isn’t achieved. To examine the real disaster recovery capability of products, we need to rely on drills or actual performance under real disasters.

But what about actual performance? In this Singapore fire, how long did the entire availability zone RDS service switchover take — multi-AZ high availability RDS services completed switching around 11:30, taking 70 minutes (failure started at 10:20), meaning RTO < 70 minutes.

This metric shows improvement compared to the 133 minutes for RDS switching in the 2022 Hong Kong Zone C failure. But compared to Alibaba Cloud’s own stated metrics (RTO < 30 seconds), it’s still off by two orders of magnitude.

As for single availability zone basic RDS services, official documentation says RTO < 15 minutes, but the actual situation is: single availability zone RDS are approaching their seventh day memorial. RTO > 7 days, and whether RTO and RPO will become infinity ∞ (completely lost and unrecoverable), we’ll have to wait for official news.

Accurately Labeling Disaster Recovery Metrics
#

Alibaba Cloud official documentation claims: RDS service provides multi-availability zone disaster recovery, “High-availability series and cluster series provide self-developed high-availability systems, achieving failure recovery within 30 seconds. Basic series can complete failover in about 15 minutes.” That is, high-availability RDS has RTO < 30s, basic single-machine version has RTO < 15min, which are reasonable metrics with no issues.

I believe when a single cluster’s primary instance experiences single-machine hardware failure, Alibaba Cloud RDS can achieve the above disaster recovery metrics — but since this claims “multi-availability zone disaster recovery”, users’ reasonable expectation is that RDS failover can also achieve this when an entire availability zone fails.

Availability zone disaster recovery is a reasonable requirement, especially considering Alibaba Cloud has experienced several availability zone-wide failures in just the past year (even one global/all-service level failure.

2024-09-10 Singapore Zone C Data Center Fire
2024-07-02 Shanghai Zone N Network Access Abnormal
2024-04-27 Zhejiang Region Access to Other Regions or Other Regions Access to Hangzhou OSS, SLS Service Abnormal
2024-04-25 Singapore Region Zone C Partial Cloud Product Service Abnormal
2023-11-27 Alibaba Cloud Partial Region Cloud Database Console Access Abnormal
2023-11-12 Alibaba Cloud Product Console Service Abnormal (Global Major Failure)
2023-11-09 Mainland China Access to Hong Kong, Singapore Region Partial EIP Inaccessible
2023-10-12 Alibaba Cloud Hangzhou Region Zone J, Hangzhou Financial Cloud Zone J Network Access Abnormal
2023-07-31 Heavy Rain Affects Beijing Fangshan Region NO190 Data Center
2023-06-21 Alibaba Cloud Beijing Region Zone I Network Access Abnormal
2023-06-20 Partial Region Telecom Network Access Abnormal
2023-06-16 Mobile Network Access Abnormal
2023-06-13 Alibaba Cloud Guangzhou Region Public Network Access Abnormal
2023-06-05 Hong Kong Zone D Certain Data Center Cabinet Abnormal
2023-05-31 Alibaba Cloud Access to Jiangsu Mobile Region Network Abnormal
2023-05-18 Alibaba Cloud Hangzhou Region ECS Console Service Abnormal
2023-04-27 Partial Beijing Mobile (formerly China Tietong) Users Network Access Packet Loss
2023-04-26 Hangzhou Region Container Registry ACR Service Abnormal
2023-03-01 Shenzhen Zone A Partial ECS Access to Local DNS Abnormal
2023-02-18 Alibaba Cloud Guangzhou Region Network Abnormal
2022-12-25 Alibaba Cloud Hong Kong Region PCCW Data Center Cooling Equipment Failure

So why can metrics achievable for single RDS cluster failures not be achieved during availability zone level failures? From historical failures, we can infer — the infrastructure that database high availability depends on is likely single-AZ deployed itself. Including what was shown in the previous Hong Kong PCCW data center failure: single availability zone failures quickly spread to the entire Region — because the control plane itself is not multi-availability zone disaster recovery.

Starting from 10:17 on December 18, some RDS instances in Alibaba Cloud Hong Kong Region Zone C began showing unavailability alarms. As the scope of affected hosts in this zone expanded, the number of instances with service anomalies increased accordingly. Engineers initiated the database emergency switching plan process. By 12:30, most cross-availability zone instances of RDS MySQL, Redis, MongoDB, DTS had completed cross-availability zone switching. Some single availability zone instances and single availability zone high-availability instances, due to dependence on single availability zone data backup, only a few instances achieved effective migration. A small number of RDS instances supporting cross-availability zone switching didn’t complete switching in time. Investigation revealed these RDS instances depended on proxy services deployed in Hong Kong Region Zone C. Due to proxy service unavailability, RDS instances couldn’t be accessed through proxy addresses. We assisted relevant customers to recover by temporarily switching to using RDS primary instance addresses for access. As the data center cooling equipment recovered, most database instances returned to normal around 21:30. For single-machine instances affected by the failure and high-availability instances with both primary and secondary in Hong Kong Region Zone C, we provided temporary recovery solutions such as instance cloning and instance migration. However, due to underlying service resource limitations, some instance migration recovery processes encountered abnormalities and required longer time to resolve.
The ECS control system has dual data center disaster recovery in zones B and C. After Zone C failure, Zone B provided external services. Due to many Zone C customers purchasing new instances in other Hong Kong zones, combined with traffic from Zone C ECS instance recovery actions, Zone B control service resources became insufficient. The newly expanded ECS control system depended on middleware services deployed in Zone C data center during startup, resulting in inability to expand for an extended period. The custom image data service that ECS control depends on relied on single-AZ redundancy version OSS service in Zone C, causing customer newly purchased instances to fail to start.

I suggest cloud products including RDS should be truthful and accurately label the RTO and RPO performance in historical failure cases, as well as actual performance under real availability zone disasters. Don’t vaguely claim “30 seconds/15 minutes recovery, no data loss, multi-availability zone disaster recovery”, promising things you can’t deliver.

What Exactly is Alibaba Cloud’s Availability Zone?
#

In this Singapore Zone C failure, as well as the previous Hong Kong Zone C failure, one problem Alibaba Cloud demonstrated is that single data center failures spread to the entire availability zone, and single availability zone failures affect the entire region.

In cloud computing, Region and Availability Zone (AZ) are very basic concepts that users familiar with AWS won’t find unfamiliar. According to AWS’s definition: a Region contains multiple availability zones, each availability zone is one or more independent data centers.

For AWS, there aren’t many Regions - for example, the US has only four regions, but each region has two or three availability zones, and one availability zone (AZ) typically corresponds to multiple data centers (DC). AWS’s practice is to control DC scale at 80,000 hosts, with distances between DCs of 70-100 kilometers. This forms a three-tier relationship of Region - AZ - DC.

However, Alibaba Cloud seems different - they lack the key concept of Data Center (DC). Therefore, Availability Zone (AZ) seems to be a data center, while Region corresponds to AWS’s upper level Availability Zone (AZ) concept.

They’ve elevated the original AZ to Region, and elevated the original DC (or part of a DC, one floor?) to availability zone.

We can’t say what motivated Alibaba Cloud’s design. One possible speculation is: originally Alibaba Cloud might have just set up North China, East China, South China, and Western regions domestically, but to make reports more impressive (look, AWS only has 4 Regions in the US, we have 14 domestically!), it became what it is now.

Cloud Vendors Have a Responsibility to Promote Best Practices
#

TBD

Can a cloud that provides single-az object storage services be evaluated as: either stupid or malicious?

Cloud vendors have a responsibility to promote good practices, otherwise when problems occur, it’s still “all the cloud’s fault”

If what you give others by default is local three-replica redundancy, most users will choose your default option.

It depends on whether there’s disclosure. Without disclosure, that’s malicious. With disclosure, can long-tail users understand? But actually many people don’t understand.

You’re already doing three-replica storage, why not put one replica in another DC or another AZ? You’re already charging hundreds of times markup on storage, why not spend a little more to do same-city redundancy? Is it to save on the data cross-AZ replication traffic fees?

Please Be Serious About Health Status Pages
#

TBD

I’ve heard of weather forecasts, but never failure forecasts. But Alibaba Cloud’s health dashboard provides us with this magical ability — you can select future dates, and future dates still have service health status data. For example, you can check service health status 20 years from now —

The future “failure forecast” data appears to be populated with current status. So services currently in failure state remain abnormal in the future. If you select 20 years later, you can still see the current Singapore major failure’s health status as “abnormal”.

Perhaps Alibaba Cloud wants to use this subtle way to tell users: Data in Singapore region single availability zone is gone, Gone Forever, don’t count on recovering it. Of course, a more reasonable inference is: this isn’t some failure forecast, this health dashboard was made by an intern. Completely without design review, without QA testing, without considering edge conditions, just slapped together and launched.

Cloud is the New Single Point of Failure, What to Do?
#

TBD

The three elements of information security CIA: Confidentiality, Integrity, Availability. Recently Alibaba has encountered major failures in all of them.

First there was Alibaba Cloud Drive’s catastrophic BUG leaking private photos and breaking confidentiality;

Now this availability zone failure, shattering the multi-availability zone/single availability zone availability myth, even threatening the lifeline — data integrity.

How Ahrefs Saved US$400M by NOT Going to the Cloud

2024-05-22·2168 words·11 mins

Cloud-Exit AWS EBS

After Alibaba Cloud’s epic global outage on Double 11, setting industry records, how should we evaluate this incident and what lessons can we learn from it?

From Cost-Reduction Jokes to Real Cost Reduction and Efficiency

2023-11-29·1880 words·9 mins

Cloud-Exit Alibaba Cloud Cloud Outages

Alibaba Cloud and Didi had major outages one after another. This article discusses how to move from cost-reduction jokes to real cost reduction and efficiency — what costs should we really reduce, what efficiency should we improve?

Are Cloud Databases an IQ Tax?

2023-01-30·5430 words·11 mins

Cloud-Exit RDS AWS Alibaba Cloud

Winter is coming, tech giants are laying off workers entering cost-reduction mode. Can cloud databases, the number one public cloud cash cow, still tell their story? The money you spend on cloud databases for one year is enough to buy several or even dozens of higher-performing servers. Are you paying an IQ tax by using cloud databases?

Technical Minimalism: Just Use PostgreSQL for Everything

2024-02-19·988 words·5 mins

PostgreSQL PG Ecosystem

Whether production databases should be containerized remains a controversial topic. From a DBA’s perspective, I believe that currently, putting production databases in Docker is still a bad idea.

Optimize Bio Cores First, CPU Cores Second

2024-09-07·2084 words·10 mins

Database

Programmers are expensive, scarce biological computing cores, the anchor point of software costs — please prioritize optimizing biological cores before optimizing CPU cores.

MongoDB Has No Future: Good Marketing Can't Save a Rotten Mango

2024-09-04·2405 words·12 mins

MongoDB PostgreSQL PG Ecosystem

MongoDB has a terrible track record on integrity, lackluster products and technology, gets beaten by PG in correctness, performance, and functionality, with collapsing developer reputation, declining popularity, stock price halving, and expanding losses. Provocative marketing against PG can’t save it with “good marketing.”

Alibaba Cloud: High Availability Disaster Recovery Myth Shattered

Disaster Recovery Performance
#

Accurately Labeling Disaster Recovery Metrics
#

What Exactly is Alibaba Cloud’s Availability Zone?
#

Cloud Vendors Have a Responsibility to Promote Best Practices
#

Please Be Serious About Health Status Pages
#

Cloud is the New Single Point of Failure, What to Do?
#

Further Reading
#

Related

How Ahrefs Saved US$400M by NOT Going to the Cloud

From Cost-Reduction Jokes to Real Cost Reduction and Efficiency

Are Cloud Databases an IQ Tax?

Technical Minimalism: Just Use PostgreSQL for Everything

Optimize Bio Cores First, CPU Cores Second

MongoDB Has No Future: Good Marketing Can't Save a Rotten Mango

Disaster Recovery Performance#

Accurately Labeling Disaster Recovery Metrics#

What Exactly is Alibaba Cloud’s Availability Zone?#

Cloud Vendors Have a Responsibility to Promote Best Practices#

Please Be Serious About Health Status Pages#

Cloud is the New Single Point of Failure, What to Do?#

Further Reading#

Related

Disaster Recovery Performance
#

Accurately Labeling Disaster Recovery Metrics
#

What Exactly is Alibaba Cloud’s Availability Zone?
#

Cloud Vendors Have a Responsibility to Promote Best Practices
#

Please Be Serious About Health Status Pages
#

Cloud is the New Single Point of Failure, What to Do?
#

Further Reading
#