free counter
Tech

How Discord supercharges network disks for extreme low latency

It’s no secret that Discord is becoming your house to talk; the 4 billion messages sent through the platform by thousands of people each day have us convinced. But text chat only makes up about a chunk of the features that Discord supports. Server roles, custom emojis, video calls, and much more all donate to the a huge selection of terabytes of data we serve to your users.

To supply this enormous level of data, we run a couple of NoSQL database clusters (powered by ScyllaDB), each one of these the foundation of truth because of their respective data set. As a real-time chat platform, we wish our databases to react to the high frequency of queries as quickly as possible.

A line graph containing

Our databases were serving around 2 million requests per second (in this screenshot.)

Scaling Beyond Our Hardware

The largest effect on our database performance may be the latency of individual disk operations – just how long it takes to learn or write data from the physical hardware. Below a particular database query rate, disk latency isn’t noticeable, as our databases execute a congrats of handling requests in parallel (not blocking about the same disk operation). But this parallelism is bound; at a particular threshold, the database will have to wait for a superb disk operation to perform before it’ll issue another. In the event that you combine this with disks that have a millisecond or two to perform a surgical procedure, the database eventually reaches a spot where it could no more immediately fetch data for incoming queries. This causes disk operations and queries to “back up”, slowing the reaction to your client who issued the query, which causes poor application performance. In the worst case, this may cascade into an ever-expanding queue of disk operations whose queries periods by enough time the disk can be acquired. This is just what we were seeing on our very own serversthe database would report an ever-growing queue of disk reads and queries would start timing out.

But wait: A millisecond or two to perform a disk operation? Why are we seeing this behavior when disk latency can usually be measured in microseconds?

Discord runs the majority of its hardware in Google Cloud plus they provide ready usage of Local SSDs NVMe based instance storage, which do have incredibly fast latency profiles. Unfortunately, inside our testing, we ran into enough reliability conditions that we didnt feel safe with based on this solution for the critical data storage. This took us back again to the drawing board just how do we get incredibly low latency whenever we cant depend on the super-fast on-device storage?

Another main method of instance storage in GCP is named Persistent Disks. They are disks which can be attached/detached from servers on the fly, could be resized without downtime, can generate point-in-time snapshots anytime, and so are replicated by design (to avoid data loss when a single little bit of hardware dies). The downside is these disks aren’t attached right to a server, but are connected from the somewhat-nearby location (most likely the same building because the server) via the network.

While latency over an area network connection is low, it’s nowhere near only over a PCI or SATA connection that spans significantly less than a meter. Which means that the common latency of disk operations (from the perspective of the operating-system) could be on the order of a couple of milliseconds, in comparison to half of a millisecond for directly-attached disks.

A line graph labeled

Local SSDs have other concerns, aswell. Much like traditional hard disks, the downside is a hardware issue basic disks (or perhaps a disk controller) means we immediately lose everything on that disk. But worse than with traditional hard disks is what goes on once the host has problems; if the host to that your Local SSDs are attached has critical issues, the disks and their data have died forever. We also lose the opportunity to create point-in-time snapshots of a whole disk, that is crucial for certain workflows at Discord (like some data backups). These missing features are why almost all Discord servers are powered by Persistent Disks rather than Local SSDs.



Evaluating the issue

In an ideal world, we’d power our databases with a disk that combined the very best properties of Persistent Disks and Local SSDs. Unfortunately no such disk exists, at the very least not within the ecosystem of common cloud providers. Requesting low latency directly-attached disks removes the abstraction that provides Persistent Disks their amazing flexibility.

But imagine if we didn’t need all of this flexibility? For example, write latency isn’t crucial for our workloadsit’s read latency which has the biggest effect on application performance (because of our read-heavy workloads). And resizing disks without downtime isn’t a significant feature – we are able to better estimate our storage growth and provision larger disks in advance.

A line graph labeled

After thinking through that which was most effective for the operation of our databases, we narrowed down certain requirements for solving our database woes:

  • Stay within Google Cloud (i.e. leverage GCP’s disk offerings)
  • Continue using point-in-time snapshotting for data backups
  • Prioritize low-latency disk reads over-all other disk metrics
  • Usually do not sacrifice existing database uptime guarantees

The various GCP disk types each meet these requirements in various ways. It could be all too convenient if we’re able to combine both disk types into one super-disk. Since our primary focus for disk performance was low-latency reads, we’d want to read from GCP’s Local SSDs (low latency) while still writing to Persistent Disks (snapshotting, redundancy via replication). But will there be ways to create this type of super-disk at the program level?



Creating the Super-Disk

What we’d described with this requirement was essentially a write-through cache, with GCP’s Local SSDs because the cache and Persistent Disks because the storage layer. We run Ubuntu on our database servers, so we were fortunate to get that the Linux kernel can cache data at the disk level in many ways, providing modules such as for example dm-cache, lvm-cache, and bcache.

Unfortunately, our experimentation with caching led us to find a couple pitfalls. The largest one was how failures in the cache disk were handled: Reading a negative sector from the cache caused the complete read operation to fail. Local SSDs, a thin layer along with NVMe SSD hardware, have problems with bad sectors like any physical disk. These bad sectors could be fixed by overwriting the sector on the cache with data from the storage layer, however the disk caching solutions we evaluated either didn’t have this capability or required more technical configuration than we wished to consider in this phase of research. Minus the cache fixing bad sectors, they’ll be subjected to the calling application, and our databases will shutdown for data safety reasons when encountering bad sector reads:

storage_service – Shutting down communications because of I/O errors until operator intervention

storage_service – Disk error: std::system_error (error system: 61, No data available)

With this requirements updated to add “Survive bad sectors on the neighborhood SSD”, we investigated a completely different kind of Linux kernel system: md

md allows Linux to generate software RAID arrays, turning multiple disks into one “array” (virtual disk). A straightforward mirrored (RAID1) array between Local SSDs and Persistent Disks wouldn’t normally solve our problem; reads would still hit the Persistent Disks for approximately 1 / 2 of all operations. However, md offers additional features not within a normal RAID controller, among that is “write-mostly”. The kernel man pages supply the best summary of the feature:

Individual devices in a RAID1 could be marked as “write-mostly”. These drives are excluded from the standard read balancing and can only be read from if you find no other option. This could be ideal for devices connected over a slow link.

Since “devices connected over a slow link” just is undoubtedly a perfect description of Persistent Disks, this appeared as if a viable technique for proceeding with creating a super-disk. A RAID1 array containing an area SSD and a Persistent Disk set to write-mostly would meet all our requirements.

One last problem remained: Local SSDs in GCP are exactly 375GB in proportions. Discord takes a terabyte or even more of storage per database instance for several applications, which means this is nowhere near enough room. We’re able to attach multiple Local SSDs to a server, but we needed a method to turn a lot of smaller disks into one larger disk.

A bar graph called

md supplies a amount of RAID configurations that stripe data across multiple disks. The easiest method, RAID0, splits raw data across all disks, and when one disk is lost, the complete array fails and all data is lost. More technical methods (RAID5, RAID6) maintain parity and invite the increased loss of a minumum of one disk at the expense of performance penalties. That is ideal for maintaining uptimejust take away the failed disk and replace it with a brand new one. However in the GCP world, there is absolutely no idea of replacing an area SSD – they are devices located deep inside Google data centers. Furthermore, GCP has an interesting “guarantee” round the failure of Local SSDs: If any nearby SSD fails, the complete server is migrated to another group of hardware, essentially erasing all Local SSD data for that server. Since we don’t (can’t) be worried about replacing Local SSDs, also to decrease the performance impact of striped RAID arrays, we settled on RAID0 as our technique to turn multiple Local SSDs into one low-latency virtual disk.

With a RAID0 along with the neighborhood SSDs, and a RAID1 between your Persistent Disk and RAID0 array, we’re able to configure the database with a disk drive that could offer low-latency reads, while still allowing us to take advantage of the best properties of Persistent Disks.

A chart displaying how Hardware and md interact. On Hardware’s Local SSD, four NVME drives interact with “mb0” on RAID0, then flow to “md1” on RAID1. As an alternate pathway, “Persistent Disk” on Hardware can “write-mostly” to “md1” on RAID1 directly, bypassing md0 on RAID0.



Database Performance

This new disk configuration looked good in testing, but how would it not behave having an actual database along with it?

We saw just what we expected – at peak load, our databases no more started queueing up disk operations, and we saw no change in query latency. Used, this implies our metrics show fewer outstanding database disk reads on super-disks than on Persistent Disks, because of less time allocated to I/O operations.

A line chart called “System iowait,” showing the times that Disks were spend idle before a new action was taken.” Persistent Disk is hovering around “8e-3,” while Super-Disk” sticker between “4e-3” and “2e-3.”

These performance increases why don’t we squeeze more queries onto exactly the same servers, that is great news for all those folks maintaining the database servers (and for the finance department).

Conclusion

In retrospect, disk latency must have been a clear concern in early stages inside our database deployments. The planet of cloud computing causes so many systems to behave with techniques that are nothing beats their physical data center counterparts. The study and testing that went into developing our super-disk solution gave us many useful performance metrics to monitor, taught the team concerning the inner workings of disk devices (in both Linux and GCP), and improved our culture of testing and validating architectural changes. With super-disks introduced to production, our databases have continued to scale with the growth of Discord’s user base.

Whoever has ever caused RAID before may be suspicious that this type of setup would just work – there are a great number of systems at play in a cloud environment that may fail in fascinating new ways. There’s more happening to aid this disk setup than simply an individual md configuration. Expect a component two to the blog post that may go into greater detail concerning the specific edge cases weve come across in the cloud environment and how weve solved them.

Lastly, if you want everything you see here, come join us! We have been hiring!

Read More

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker