What Can We Learn from NetEase Cloud Music's Outage?

Table of Contents

This afternoon around 14:44, NetEase Cloud Music experienced an outage, recovering at 17:11. The rumored cause was infrastructure/cloud/ disk storage related issues.

Incident Timeline
#

During the outage, NetEase Cloud Music clients could normally play offline downloaded music, but accessing online resources resulted in direct error messages, while the web version showed 502 server errors and was completely inaccessible.

During this period, NetEase’s 163 portal also experienced 502 server errors and later redirected to the mobile version. Some users also reported that NetEase News and other services were affected.

Many users thought their internet was down when they couldn’t connect to NetEase Cloud Music, leading some to uninstall and reinstall the app, while others assumed their company IT had blocked music streaming sites. Various comments quickly pushed this outage to Weibo trending topics:

The outage lasted until 17:11 when NetEase Cloud Music recovered, and the 163 main portal switched back from mobile to desktop version. The total outage duration was approximately two and a half hours — a P0 incident.

At 17:16, NetEase Cloud Music’s Zhihu account posted an apology notice, stating that searching for “enjoy music” tomorrow would provide a 7-day Black Vinyl VIP friend fee.

Root Cause Analysis
#

During this period, various rumors and hearsay emerged. Headquarters on fire 🔥 (old photos), TiDB crash (netizen speculation), downloading Black Myth: Wukong overloading the network, and programmers deleting databases and fleeing were obviously fake news.

However, there was a previously published article from NetEase Cloud Music’s official account “Cloud Music Guizhou Data Center Migration Overall Plan Review” and two detailed leaked chat records that serve as references.

The rumored cause relates to cloud storage issues. While I won’t post the leaked chat records, you can refer to screenshots in articles like “NetEase Cloud Music Down, Cause Exposed! Data Center Migration Completed in July, Rumored Related to Cost Reduction” or authoritative media coverage “Exclusive | NetEase Cloud Music Outage Truth: Technical Cost Reduction, Insufficient Staff Took Half a Day to Troubleshoot”.

We can find some public information about NetEase’s cloud storage team, for example, NetEase’s self-developed cloud storage solution Curve project was terminated.

Checking the Github Curve project homepage, we find the project has been stagnant since early 2024:

The last release remains at RC without an official version, and the project has essentially become unmaintained, entering silent mode.

The Curve team leader also published an article “Curve: Regretful Farewell, Unfinished Journey” on their official account, which was subsequently deleted. I had some impression of this because Curve was one of two open-source shared storage solutions recommended by PolarDB, so I specifically researched this project. Now it seems…

Lessons Learned
#

We’ve discussed layoffs and cost reduction many times before. What additional lessons can we learn from this incident? Here are my thoughts:

The first lesson is: Don’t run serious databases on cloud disks! On this matter, I can indeed say “Told you so”. Underlying block storage is primarily used for databases. If failures occur here, the blast radius and debugging difficulty far exceed the intellectual bandwidth of typical engineers. Such significant outage duration (two and a half hours) clearly wasn’t a stateless service problem.

The second lesson — Self-developed solutions are fine, but keep people around to maintain them. Cost reduction eliminated the entire storage team, leaving no one to help when problems arose.

The third lesson: Beware of big company open source. As an underlying storage project, once deployed, it’s not something you can simply replace. When NetEase killed the Curve project, all infrastructure using Curve became unmaintained ruins. Stonebraker mentioned this in his famous paper “What Goes Around Comes Around”:

Reference Reading
#

NetEase Cloud Music Down

GitHub Global Outage, Another Database Rollover?

Alibaba Cloud Down Again, This Time Cable Cut?

Global Windows Blue Screen: Both Parties Are Amateur Hour

Database Deletion: Google Cloud Wiped Out Fund’s Entire Cloud Account

Cloud Dark Forest: Bankrupting AWS Bills with Just S3 Bucket Names

Internet Tech Master Crash Course

How State Enterprises Inside View Cloud Vendors Outside

Alibaba Cloud Stuck at Government Enterprise Customer Gates

Amateur Hour Behind Internet Outages

Cloud Vendors’ View of Customers: Poor, Idle, and Needy

taobao.com Certificate Expired

Are Cloud SLAs Placebo or Toilet Paper Contracts?

Luo Yonghao Can’t Save Toothpaste Cloud

Outages Aren’t Why Tencent Cloud Is Amateur Hour, Arrogance Is

Tencent Cloud Computing Epic Second Rollover

Redis Going Closed Source Is a Disgrace to “Open Source” and Public Cloud

Analyzing Cloud Computing Costs: Did Alibaba Cloud Really Lower Prices?

What Can We Learn from Tencent Cloud’s Post-Mortem?

Tencent Cloud: Face-Lost Amateur Hour

From “Cost Reduction LOL” to Real Cost Reduction and Efficiency

Alibaba Cloud Weekly Explosion: Cloud Database Management Down Again

What We Can Learn from Alibaba Cloud’s Epic Outage

Alibaba Cloud Computing Epic Rollover

What Can We Learn from Alibaba Cloud's Global Outage?

2023-11-13·2829 words·14 mins

Cloud-Exit Alibaba Cloud Outage

Alibaba Cloud’s epic global outage after Double 11 set an industry record. How should we evaluate this incident, and what lessons can we learn from it?

Incident Report: Integer Overflow from Rapid Sequence Number Consumption

2018-07-20·651 words·4 mins

PostgreSQL PG Management Incident Report

If you use Integer sequences on tables, you should consider potential overflow scenarios.

Whoever Integrates DuckDB Best Wins the OLAP World

2024-08-13·1322 words·7 mins

PostgreSQL PG Ecosystem

Just like the vector database extension race two years ago, the current PostgreSQL ecosystem extension competition has begun revolving around DuckDB. MotherDuck’s official entry into the PostgreSQL extension space undoubtedly signals that competition has entered white-hot territory.

StackOverflow 2024 Survey: PostgreSQL Has Gone Completely Berserk

2024-07-25·659 words·4 mins

PostgreSQL PG Ecosystem

The 2024 StackOverflow Global Developer Survey results are fresh out, and PostgreSQL has become the most popular, most loved, and most wanted database globally for the second consecutive year. Nothing can stop PostgreSQL from devouring the entire database world anymore!

Switzerland Mandates Open Source for Government Software

2024-07-24·729 words·4 mins

Open Source

Switzerland’s government leads the way with open source legislation, showing IT developing countries how to ensure software sovereignty and control. True autonomy and control stem from “open source communities,” not some “nationalist” style “domestic software.”

Blue Screen Friday: Amateur Hour on Both Sides

2024-07-23·1125 words·6 mins

Cloud-Exit Cloudflare

Both client and vendor failed to control blast radius, leading to this epic global security incident that will greatly benefit local-first software philosophy.

Incident Timeline#

Root Cause Analysis#

Lessons Learned#

Reference Reading#

Related

Incident Timeline
#

Root Cause Analysis
#

Lessons Learned
#

Reference Reading
#