Author: Alex Miller 2024-11-19 @ Snowflake, Apple, Google
Translator: Feng Ruohang & GPT o1, PG Grand Master, Database Veteran, Cloud Computing Maverick
Translator’s Recommendation: This article is a comprehensive review of how hardware developments affect database design, introducing key hardware advances in networking, storage, and computing. I’ve always believed that fully utilizing new hardware (rather than tinkering with so-called distributed systems) is the right path for database kernel development. See “Reclaiming the Dividends of Computer Hardware” and “Are Distributed Databases a False Requirement”. This article excellently introduces some cutting-edge software-hardware integration practices in the database field, worth reading.
Original: Modern Hardware for Future Databases
We’re in an exciting era for databases, with continuous progress in every major resource domain, and each advance has the potential to influence optimal database architectures. Overall, I hope to see some interesting shifts in database architectures over the next decade, but I’m uncertain whether we’ll have the necessary hardware support.
Networking#
According to Stonebraker’s presentation at HPTS 2024, some benchmarks using VoltDB found that approximately 60% of server-side CPU time was spent on the TCP/IP protocol stack. VoltDB itself is a database architecture designed to eliminate as much non-query processing work as possible to serve requests, so this is an extreme example. However, this still effectively points out that TCP’s computational overhead is not trivial, and this problem becomes more apparent as network bandwidth increases. While this isn’t a new observation, a series of incremental upgrade solutions have been proposed.
One proposed solution is replacing TCP with another UDP-based protocol, with QUIC being a commonly chosen example. However, this idea is misguided. “While this is a seriously inaccurate simplification, at the simplest level, QUIC is just TCP encapsulated and encrypted in a UDP payload.” The CPU overhead of TCP and QUIC is also very similar. To achieve significant improvements, one needs to deviate further from TCP and specialize for specific environments, such as papers like Homa showing some improvements in datacenter environments. But even with better protocols, the greater optimization potential still lies in reducing kernel network stack overhead.
Note: If you’re wondering while reading why QUIC is mentioned here, it’s because I’ve participated multiple times in discussions where TCP or TLS was blamed for certain problems, and migrating to QUIC was suggested as a solution. QUIC can indeed help solve some problems, but there are also some problems it can’t improve, or might even make worse. It’s important to understand that steady-state latency and bandwidth belong to the latter category.
One way to reduce kernel workload is moving computationally intensive but simple parts to hardware. This has been gradually implemented over time, such as enhanced offloading of segmentation and checksum tasks to network cards. A more recent improvement is KTLS, which allows offloading packet encryption in TLS to network cards as well. Attempts to offload entire TCP to hardware, in the form of TCP Offload Engines (TOE), have been systematically rejected by Linux maintainers. Therefore, despite these nice improvements, the main parts of the TCP protocol stack remain the kernel’s responsibility.
Hence, another solution is removing the kernel as an intermediary layer between network cards and applications. Frameworks like Data Plane Development Kit (DPDK) allow user space to poll network cards for packets, eliminating interrupt overhead, and keeping all processing in user space means no need to enter and exit the kernel. DPDK has also faced difficulties in adoption because it requires exclusive control of network cards. Therefore, each host needs two network cards, one for DPDK and another for the operating system and all other processes. Marc Richards created a nice Linux kernel vs DPDK benchmark, showing DPDK provides 50% throughput improvement, then lists a series of drawbacks that must be accepted to gain this 50% improvement. This seems to be a tradeoff most databases aren’t interested in - even ScyllaDB has largely abandoned investment in this.
Newer hardware provides an interesting new option: removing CPUs from the network path. RDMA (Remote Direct Memory Access) provides verbs, a limited set of operations (mainly read, write, and 8-byte CAS) that can be executed entirely within network cards without CPU interaction. Cutting out CPUs, remote read latency approaches 1 microsecond, while TCP latency exceeds 100 microseconds. As part of RDMA, responsibility for packet loss and flow control is also completely delegated to network cards. Cutting out CPUs also means large amounts of data can be transmitted without making CPUs bottlenecks.
Note: Why is delegating packet loss detection and flow control to hardware acceptable for RDMA, but Linux maintainers have consistently refused to do this for TCP? Because this is a different and much more restricted API, reducing complexity between network cards and hosts. “TCP Offload is a Dumb Idea Whose Time Has Come” is interesting reading in this area. (From 2003!)
Using RDMA as low-latency and high-throughput network primitives changes how people design databases. “The End of a Myth: Distributed Transactions Can Scale” shows RDMA’s low latency enables classic 2PL+2PC to scale to large clusters. “Is Scalable OLTP in the Cloud a Solved Problem?” proposes sharing writable page caches between nodes, because low latency makes tighter coupling of components feasible. RDMA isn’t just for OLTP databases; BigQuery uses RDMA Shuffle-based joins because of its high throughput. Changing latency and CPU utilization at given throughput changes optimal design choices, or unlocks new designs previously considered infeasible[^3].
Note: To use RDMA, I strongly recommend using libfabric, as it abstracts over all different RDMA vendors and libraries. RDMAmojo blog has years of specialized content about RDMA and is one of the best resources for learning all aspects of RDMA.
Finally, there’s a newer class of hardware continuing the trend of putting more computational power into network cards themselves: SmartNICs or Data Processing Units (DPUs). They allow arbitrary computation to be delegated to network cards and potentially invoked in response to requests from other network cards. These technologies are quite novel. I suggest checking out “DPDPU: Data Processing with DPUs” for an overview, “DDS: DPU-optimized Disaggregated Storage” for how to integrate them into databases, and “Azure Accelerated Networking: SmartNICs in the Public Cloud” for deployment details. Overall, I expect SmartNICs to extend RDMA from simple reads and writes to allow general-purpose RPC bypassing CPUs (for computationally inexpensive request-reply).
Storage#
In storage devices, there are some advances aimed at reducing total cost of ownership for storage devices in specific use cases. Manufacturers cleverly discovered they could read narrower strips than the track width that magnetizes hard disk platters from writes, so tracks could be overlapped to minimum width. Thus, we have Shingled Magnetic Recording (SMR) hard drives, introducing the concept of dividing storage into zones that only support append or erase. SMR HDDs target use cases like object storage where access is infrequent but large amounts of data need storage.
Similar ideas have been applied to SSDs, and Zoned SSDs have also emerged. Exposing zones in SSDs means drives don’t need to provide Flash Translation Layers (FTL) or complex garbage collection processes. Like SMR, this reduces ZNS SSD costs relative to “regular” SSDs, but also specifically focuses on application-driven garbage collection being more efficient, thus reducing total write amplification and extending drive lifespan. Consider LSMs (Log-Structured Merge Trees) on SSDs - they already operate through incremental appends and large erase blocks. Removing FTL between LSMs and SSDs opens optimization opportunities. Recently, Google and Meta collaborated on Flexible Data Placement (FDP) proposals, which are more like hints for grouping writes with related lifecycles rather than strictly enforcing zones like ZNS. The goal is achieving easier upgrade paths, allowing SSDs to ignore FDP parts of write requests while still being semantically correct, just with worse performance or write amplification.
Note: If you were expecting discussion about persistent memory, unfortunately Intel has discontinued Optane, so this is currently a dead end. There seem to be some companies like Kioxia or Everspin continuing efforts in this area, but I haven’t heard of actual applications.
Other improvements aren’t aimed at cost efficiency but at improving the feature set supported by storage devices. Particularly focusing on NVMe, NVMe added copy commands to eliminate the waste of reading and writing the same data. Fused compare-and-write commands allow CAS operations to be delegated to drives themselves, enabling innovative designs like delegating optimistic lock coupling to drives. NVMe inherited support for Data Integrity Field (DIF) and Data Integrity Extension (DIX) from SCSI, enabling page checksums to be delegated to drives (Oracle notably uses this). There are also projects like KV-SSD that change the entire data model from storing blocks by index to storing objects by key, even going toward completely replacing software storage engines. SSD manufacturers continue to make SSDs capable of more operations.
Note: As of July 25, 2024, AWS has discontinued publishing S3 Select, possibly to support S3 Object Lambda.
As the penultimate step in SSD functionality, SmartSSDs are emerging, allowing arbitrary computation to be integrated into SSDs. “Query Processing on SmartSSDs: Opportunities and Challenges” reviews their applications in query processing tasks. Pushing filters down to storage is always beneficial; I often cite earlier work like PushdownDB leveraging S3 Select as excellent examples in the analytics domain. Using SmartSSDs, we have papers like “POLARDB Meets Computational Storage”. Even without specialized integration, some argue that even transparent drive-internal compression can narrow the gap between B+ trees and LSMs in write amplification (reference). Leveraging SmartSSDs remains an emerging research area, but its potential impact is enormous.
Computing#
Transaction Processing#
At the recent VLDB conference, two authorities in database research published a position paper: “Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern Hardware”, arguing that Unikernels allow databases to customize operating systems for their exact needs. Earlier work on VMCache particularly emphasized the challenges of efficient database buffer management, where one must either accept the complexity of pointer swizzling or frequently hook into the kernel calling mmap()
related system calls.
Neither choice is ideal, while Unikernels provide direct access to virtual memory primitives. As this area receives more attention, the effort required to develop Unikernels is decreasing. Akira Kurogane got MongoDB running as a Unikernel through Unikraft with minimal cost, with subsequent posts showing performance improvements without any MongoDB internal changes. There’s always been an endless joke that databases want to become operating systems, because the desire for performance improvements requires more control over networking, file systems, disk I/O, memory, etc., and Unikernel databases provide exactly that, making it possible.
To achieve data confidentiality beyond TLS or disk encryption, secure enclaves allow execution of verifiably untampered code, making operated data immune from compromised operating systems. Trusted Platform Modules (TPM) allow keys to be securely stored in machines, while secure enclaves extend to arbitrary code and data. This makes it possible to build databases with extremely high resilience to malicious attacks, but with several design constraints. Microsoft has published research on integrating secure enclaves into Hekaton and has released this work as part of SQL Server Always Encrypted. Alibaba has also published their efforts in building enclave-native storage engines for enterprise customers concerned about data confidentiality. Databases have always promoted security improvements through regulatory compliance, and secure enclaves are a meaningful advance in data confidentiality.
Since Spanner introduced TrueTime, clock synchronization has become highly focused in transaction ordering for geographically distributed databases. Every major cloud provider has an NTP service connected to atomic clocks or GPS satellites (AWS, Azure, GCP). This is very useful for any similar designs, such as CockroachDB or Yugabyte, whose correctness critically depends on clock synchronization, while conservative wide error ranges degrade performance. AWS’s recent Aurora Limitless also uses a TrueTime-like design. This is the only cloud-specific, not entirely hardware content mentioned, because this is major cloud suppliers providing users with expensive hardware (atomic clocks) that users wouldn’t otherwise consider purchasing themselves.
Hardware transactional memory has quite an unfortunate history. Sun’s Rock processor had hardware transactional memory capabilities until Sun was acquired and the Rock project was terminated. Intel attempted to release it twice but had to disable it both times. There’s some interesting work on applying hardware transactional memory to in-memory databases, but besides finding some old CPUs for experiments, we all have to wait for CPU manufacturers to announce they plan to try again.
Note: The first time was due to a bug, the second time due to a side-channel attack breaking KASLR. There’s also a speculative execution timing attack discovered by misunderstanding CTF challenge intent.
Query Processing#
There have always been companies founded trying to leverage specialized hardware to accelerate query processing, achieving better performance and cost efficiency than competitors using only CPUs. GPU-driven databases like Voltron, HEAVY.ai, and Brytlyt are first steps in this direction. I wouldn’t be surprised if Intel or AMD’s integrated graphics gain OpenCL support at some point in the future, opening doors for all databases to assume some degree of GPU capability in broader hardware configurations.
Note: OpenGL compute shaders are the most general and portable form of using GPUs for arbitrary computation, and integrated graphics chipsets already support these. However, I can’t find any database-related papers using them.
There are also opportunities to use more energy-efficient hardware. Latest Neural Processing Units (NPUs) and Tensor Processing Units (TPUs) have been proven usable for query processing in work like “TCUDB: Accelerating Database with Tensor Processors”. Some companies have tried leveraging FPGAs. Swarm64 attempted (but possibly failed) to enter this market. AWS itself also attempted with Redshift AQUA. Even the largest companies seem to find going to ASICs not worthwhile, as even Oracle stopped their SPARC development in 2017. I’m not very optimistic about FPGA to ASIC prospects, because memory bandwidth will become the main bottleneck anyway at some point, but ADMS is the conference to watch for papers in this area.
Note: Strictly speaking, ADMS is a workshop affiliated with VLDB, but I don’t know the word for collectively referring to conferences, journals, and workshops.
Cloud Availability#
Finally, let’s face this frustrating fact: if unavailable, none of these hardware advances matter. For today’s systems, this means cloud, and cloud hasn’t provided customers with cutting-edge hardware advances.
In networking, the situation isn’t ideal. DPDK is the most accessible advanced networking technology because most clouds allow certain types of instances to have multiple network cards. AWS provides pseudo-RDMA in the form of Scalable Reliable Datagram (SRD), which according to benchmarks, performs roughly between TCP and RDMA. True RDMA is only available in high-performance computing instances on Azure, GCP, and OCI. Only Alibaba provides RDMA on general computing instances.
Note: Though there might be similar poor latency impacts as SRD. Alibaba deployed RDMA through iWARP, which might be slightly slower, but I haven’t seen any benchmarks.
SmartNICs aren’t publicly available anywhere. There are good reasons for this: Microsoft’s published papers point out that deploying RDMA is difficult. In fact, very difficult. Even their paper on successfully using RDMA emphasizes how very difficult it is. It’s been nearly a decade since Microsoft started using RDMA internally, but it’s still not available in their cloud. I can’t guess whether or when it might appear.
In storage, the situation isn’t much better. SMR HDDs, when they briefly entered consumer markets, still appeared as drives supporting block storage APIs, and consumers reacted very negatively. ZNS SSDs seem similarly locked behind enterprise-only procurement agreements. One might think Intel stopping Optane brand persistent memory and SSDs means they’re unavailable in clouds, but Alibaba still provides persistent memory optimized instances. The excellent team at Spare Cores actually provided me with nvme id-ctrl
output from every cloud supplier, and none of the NVMe devices they acquired support any optional features: copy, fused compare-and-write, data integrity extensions, or multi-block atomic writes.
Note: Though AWS supports torn write prevention, and GCP previously had similar documentation.
Alibaba is also the only cloud supplier investing in SmartSSDs, collaborating with ScaleFlux on research for PolarDB. This still means SmartSSDs aren’t publicly available, but even the paper acknowledges this is “the first actual deployment of a cloud-native database using computational storage drives reported in public literature.”
In computing, the situation finally improves. Clouds fully allow Unikernels, TPMs are widely available, but as far as I know, only AWS and Azure support secure enclaves. Time synchronization is available but without promised error bounds making it impossible to critically depend on. (Hardware transactional memory isn’t available, but it’s hard to blame cloud suppliers for this.) AI’s explosive growth means sufficient funding to support more efficient computing resources. GPUs are available in all clouds. AWS[^5], Azure, IBM, and Alibaba provide FPGA instances. (GCP and OCI don’t.) The unfortunate reality is that faster computing only matters when computing becomes the bottleneck. Both GPUs and FPGAs suffer from memory limitations, so can’t maintain databases in their local memory. Instead, they need to rely on data streaming in and out, meaning they’re limited by PCIe speeds. All this encourages thoughtful motherboard layouts and bus designs in local devices, but this isn’t feasible in clouds.
Note: Ideally, one would want peer DMA support, able to read data directly from disk to FPGA, but at least AWS’s F1 doesn’t support this.
Therefore, my view on next-generation databases is pessimistic: no one can build databases that critically depend on these new hardware advances before they become available, but no cloud supplier wants to deploy hardware that can’t be immediately used. Next-generation databases are trapped by this circular dependency because they don’t yet exist.
Note: Except cloud suppliers themselves. Most notably, Microsoft and Google already have RDMA internally and extensively use it in their database products while not allowing public use. I’ve had a draft article outline titled “Cloud Suppliers’ RDMA Competitive Advantage.”
However, Alibaba’s performance is surprisingly excellent. They consistently lead in making all hardware advances available. I’m surprised not to see frequent use of Alibaba for benchmarking in academia and industry.