Cloud-Exit High Availability Secret: Rejecting Complexity Masturbation

Table of Contents

We don’t need Kubernetes masters or fancy new databases — programmers are drawn to complexity like moths to flame. The more complex the system architecture diagram, the greater the intellectual masturbation high. Our steadfast resistance to this behavior is a key reason for our success in cloud-free availability.

Author: David Heinemeier Hansson, known as DHH, Co-founder & CTO of 37signals, Creator of Ruby on Rails, cloud exit advocate, practitioner, and pioneer. Frontrunner in fighting tech giant monopolies. Hey Blog
Translator: Vonng (Feng Ruohang), Founder & CEO of PIGSTY. Author of Pigsty, PostgreSQL expert/evangelist. Host of WeChat public account “Illegal Plus Feng”, cloud computing mudslide, database veteran.
This article is translated from DHH’s blog post

Keeping the Lights On While Leaving the Cloud
#

Keeping the lights on while leaving the cloud

For the ops team at 37signals, 2023 was undoubtedly a challenging year. We migrated seven core applications from the cloud, including the email service HEY that was born in the cloud — which has extremely stringent availability requirements that our cloud exit process couldn’t compromise. Fortunately, we succeeded. In 2023, HEY achieved a remarkable 99.99% uptime!

This is critically important because if people can’t access their email, they might miss flight check-ins, fail to complete time-sensitive transactions, or miss critical medical test results. We take this responsibility very seriously, so achieving this near-perfect four nines during a year that required completely transforming how HEY operates became a source of tremendous pride.

But HEY wasn’t the only application receiving this meticulous operational treatment. In 2023, all our major applications achieved at least 99.99% availability. This includes Highrise, Backpack, Campfire, and all versions of Basecamp. We didn’t encounter zero issues — but our team quickly resolved all problems, keeping total downtime for the entire year under 0.01%.

No application better illustrates our ability to ensure application reliability and stability outside the cloud than Basecamp 2. This is the version of Basecamp we sold from 2012 to 2015, still serving thousands of users and generating millions in revenue. It has been running on our own hardware for years, and this is now the second consecutive year achieving an almost unbelievable 100% availability — 365 days of zero downtime in 2023, continuing the glory of 2022.

I won’t pretend that such excellent availability is effortless, because it’s not. Achieving this is far from easy. We have a skilled and dedicated ops team that deserves high praise for their tremendous contributions to this goal. But it’s also not rocket science!

A considerable portion of Basecamp 2’s magic in achieving 100% availability for two consecutive years, and all other applications reaching 99.99% availability, comes from our choice of simple, boring, fundamentally solid technology. We use F5, Linux, KVM, Docker, MySQL, Redis, ElasticCache, and of course Ruby on Rails. Our tech stack is unassuming and straightforward, primarily because complexity is low — we don’t need Kubernetes masters or fancy databases and storage. Most of the time, you won’t need them either.

But programmers are drawn to complexity like moths to flame. The more complex the system architecture diagram, the greater the intellectual masturbation high. Our steadfast resistance to this behavior is the fundamental reason for our victory in availability.

I’m not talking about the technology needed to operate Netflix, Google, or Amazon. At that scale, you indeed encounter truly pioneering problems with no ready-made solutions to borrow from. But for the rest of us 99.99%, mimicking their imagination and cognition to model our own infrastructure is an alluring but deadly siren song.

To have good availability, you need not the cloud, but mature technology running on redundant hardware with proper backups configured, as always.

Note: DHH saved nearly $10 million in high cloud costs. This article translates DHH’s latest cloud exit progress. For the cloud exit backstory and complete process, refer to: “Cloud-Exit Odyssey”, “Is It Time to Give Up on Cloud Computing?”, and “DHH Cloud-Exit FAQ”.

Translator’s Commentary
#

DHH points out the best practice for maintaining good availability — running humble, mature, foundational technology on redundant hardware. Most of software’s cost overhead isn’t in the initial development phase, but in the ongoing maintenance phase. And simplicity is crucial for system maintainability.

Some programmers, out of intellectual masturbation or job security reasons, pile unnecessary additional complexity into architectural designs — such as throwing Kubernetes at everything regardless of scale and load appropriateness, or using glue code to wire together a bunch of flashy databases. Seeking “cool enough” things to satisfy personal value needs, rather than considering whether the problems to be solved actually need these dragon-slaying techniques.

Rube Goldberg machine: “Accomplishing through extremely complex and circuitous methods what could actually or seemingly be done easily” — a form of intellectual masturbation through complexity.

Complexity slows everyone down and significantly increases maintenance costs. Making changes in complex systems carries greater risk of introducing bugs (such as the major failures described in “From Cost-Reduction Jokes to Real Cost Reduction”). When complexity leads to maintenance difficulties, budgets and timelines typically overrun. When developers struggle to understand the system, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked. Reducing complexity can dramatically improve software maintainability, so simplicity should be a key goal in building systems.

Not every company has Google’s scale and scenarios, requiring starships to solve their unique problems. PostgreSQL + Go/Ruby/Python on bare metal/VMs or classic LAMP has taken countless companies all the way to IPO. Never forget that designing for unneeded scale is wasted effort — this is a form of premature optimization — and that is the root of all evil.

Using my personal experience as an example, during Tantan’s early-to-mid stages with millions of daily active users, the tech stack remained very humble — applications written purely in Go, database using only PostgreSQL. At the scale of 2.5M TPS and 200TB of data, single PostgreSQL selection could stably and reliably support the business: beyond its primary OLTP role, it also served for quite a long time as cache, OLAP, batch processing, and even message queue. Eventually some moonlighting functions were gradually separated to dedicated components, but that was already at nearly 10 million daily active users, and in hindsight, the necessity of some of those new components is questionable.

Therefore, when we conduct architectural design and reviews, we might use the complexity perspective for additional scrutiny. For more discussion on complexity, please refer to the following articles:

Should Databases Go into K8S?

From Cost-Reduction Jokes to Real Cost Reduction

Is Putting Databases in Docker a Good Idea?

Are Microservices a Stupid Idea?

Are Distributed Databases False Needs?

S3: Elite to Mediocre

2023-12-26·2507 words·6 mins

Cloud Cloud-Exit S3 MinIO

S3 is no longer “cheap” with the evolution of hardware, and other challengers such as cloudflare R2.

From Cost-Reduction Jokes to Real Cost Reduction and Efficiency

2023-11-29·1875 words·9 mins

Cloud-Exit Alibaba-Cloud Cloud-Outage

Alibaba-Cloud and Didi had major outages one after another. This article discusses how to move from cost-reduction jokes to real cost reduction and efficiency — what costs should we really reduce, what efficiency should we improve?

Reclaim Hardware Bonus from the Cloud

2023-11-16·2842 words·14 mins

Cloud Cloud-Exit Hardware

Hardware is interesting again, developments in CPUs and SSDs remain largely unnoticed by the majority of devs. A whole generation of developers is obscured by cloud hype and marketing noise.

What Can We Learn from Alibaba-Cloud's Global Outage?

2023-11-13·2789 words·14 mins

Cloud-Exit Alibaba-Cloud Cloud-Outage

Alibaba-Cloud’s epic global outage after Double 11 set an industry record. How should we evaluate this incident, and what lessons can we learn from it?

Harvesting Alibaba-Cloud Wool, Building Your Digital Homestead

2023-11-08·2342 words·11 mins

Cloud-Exit Alibaba-Cloud ECS

Alibaba-Cloud’s Double 11 offered a great deal: 2C2G3M ECS servers for ¥99/year, low price for three years. This article shows how to use this decent ECS to build your own digital homestead.

Cloud Computing Mudslide: Deconstructing Public Cloud with Data

2023-07-08·1886 words·9 mins

Cloud-Exit

Once upon a time, “going to cloud” was almost politically correct in tech circles, but few people use hard data to analyze the trade-offs involved. I’m willing to be this skeptic: let me use hard data and personal stories to explain the traps and value of public cloud rental models.

Keeping the Lights On While Leaving the Cloud#

Translator’s Commentary#

Related

Keeping the Lights On While Leaving the Cloud
#

Translator’s Commentary
#