free counter

Plan B for UUIDs: double AES-128

It appears like internauts are experiencing another go at the UUID as primary key debate,where in fact the fundamental problem may be the tension between nicely structured primary keys that have a tendency to improve spatial locality in the storage engine,and unique but otherwise opaque identifiers that avoid running into Hyrums law when communicating with external entities and generally prevent unintentional information leakage.1

I assume Im lucky that the systems Ive done mostly fall in two classes: 2

  1. people that have trivial write load (often trivial load generally), where in fact the performance implications of UUIDs for primary keys are irrelevant.

  2. those where performance concerns lead us to heavily partition the info, by tenant or even more finely making information leaks from sequentially allocation a concern.

Needless to say, theres always the chance that something in the initial class eventually handles a higher load.Until roughly 2016, I figured we’re able to always sacrifice some opacity and switch to one of the numerous k-sorted alternatives developed by webscale companies.

By 2016-17, I felt comfortable assuming AES-NI was on any x86 server,3 and that opens up an alternative:use structured leaky keys internally, and encrypt/decrypt them at the edge (e.g., by printing a user-defined type in the database server).Assuming we obtain the cryptography right, this approach lets ushave our cake (present structured keys to the databases storage engine),and eat it too (present opaque unique identifiers to external parties),provided that the computation overhead of repeated encryption and decryption at the edge remains reasonable.

I cant know why this process has so little mindshare, but I believe area of the reason should be that developers generally have an outdated mental cost model for strong encryption like AES-128.4This quantitative concern may be the easiest to handle, so thats what Ill do in this article.That leaves the most common hard design questions around complexity, debuggability, and failure modes and new ones linked to symmetric key management.

Brandur compares sequential keys and UUIDs.Im thinking more generally about structured keys, which might be sequential in single-node deployments, or add a short sharding prefix in smaller (range-sharded) distributed systems.Eventually, a brief prefix will go out of bits, and fully random UUIDs are better quality for range-sharded systems that may scale out to a huge selection of nodesespecially ones focused more on horizontal scalability than single-node performance.

That said, design decisions that unlock scalability to hundreds or a large number of nodes tend to also force one to distribute work over twelve machines whenever a laptop may have sufficed.

Mentioning cryptography makes people require a crisp threat model.There isnt one here (and the question is practical outside cryptography and auth!).

Based on the domain, leaky or guessable external ids can enable scraping,let competitors estimate the creation rate and amount of accounts (or, similarly, activity) in the application,or, more benignly, expose an accidentally powerful API endpoint which will be difficult to displace.

Instead of make an effort to pinpoint the precise degree of dedication were attempting to foil, from curious power user to nation state actor, lets shoot for something thats hopefully as hard to break as our transport (e.g., HTTPS).AES ought to be helpful.

Hardware-assisted AES: not not fast

Intel shipped their first chip with AES-NI in 2010, and AMD in 2013.Ten years later, its not exotic, and can be acquired even yet in low-power Goldmont Atoms.For consumer hardware, with an extended tail of old machines than servers, the May 2022 Steam hardware survey shows 96.28% of the responses originated from machines that support AES-NI (under Other Settings), an availability rate somewhere within those of AVX (2011) and SSE4.2 (2008).

The core of the AES-NI extension to the x86-64 instruction set is really a couple of instructions to execute one round of AES encryption (AESENC) or one round of decryption (AESDEC) on a 16-byte block.Andreas Abels implies that the initial implementation, in Westmere, had a 6-cycle latency for every round, and that Intel and AMD have already been optimising the instructions to create their latencies right down to 3 (Intel) or 4 (AMD) cycles per round.

Thats very good (on the order of a multiplication), but each instruction only handles one round. The schedule for AES-128, the fastest option, includes 10 rounds:a short whitening xor, 9 aesenc / aesdec and 1 aesenclast / aesdeclast.Multiply 3 cycles per round by 10 real rounds, and we look for a latency of 30 cycles ((+ 1) for the whitening xor) on recent Intels and (40 + 1) cycles on recent AMDs, assuming the main element material has already been obtainable in registers or L1 cache.

This may be disappointing considering that AES128-CTR could already achieve a lot more than 1 byte/cycle in 2013.Theres a gap between throughput and latency because pipelining lets contemporary x86 chips start two rounds per cycle, while prior rounds remain in flight (i.e., 6 concurrent rounds when each includes a 3 cycle latency).

Still, 35-50 cycles latency to encrypt or decrypt an individual 16-byte block with AES-128 is comparable to a L3 cache hitreally not that bad when compared with executing a durable DML statement, or perhaps a single lookup in a big hash table stored in RAM.

A trivial encryption scheme for structured keys

AES works on 16 byte blocks, and 16-byte randomish external ids are usually accepted practice.The easiest method of turn structured keys into something thats provably difficult to tell apart from random bits probably goes the following:

  1. Fix a worldwide AES-128 key.
  2. Let primary keys contain a sequential 64-bit id and a randomly generated 64-bit integer.5
  3. Convert a primary key to an external id by encrypting the principal keys 128 bits with AES-128, utilizing the global key (each global key defines a distinctive permutation from 128 bits input to 128 bit output).
  4. Convert an external id to a potential primary key by decrypting the external id with AES-128, utilizing the same global key.

source: aes128.c

The computational core is based on the encode and decode functions, two identical functions from the performance perspective.We are able to estimate just how long it requires to encode (or decode) an identifier by executing encode in a good loop, with a data dependency linking each iteration to another;the info dependency is essential to avoid superscalar chips from overlapping multiple loop iterations.6

uiCA predicts 36 cycles per iteration on Ice Lake.On my unloaded 2 GHz EPYC 7713, I observe 50 cycles/encode (without frequency boost), and 13.5 ns/encode when boosting an individual active core.Thats orders of magnitude significantly less than a syscall, and in exactly the same range as a slow L3 hit.

source: aes128-latency.c

This simple solution works if our external interface may expose arbitrary 16-byte ids.AES-128 defines permutation, so we’re able to also run it backwards to create sequence/nonce pairs for preexisting rows that avoid changing their external id an excessive amount of (e.g., pad integer ids with zero bytes).

However, its sometimes vital that you generate valid UUIDs,or even to at the very least save one bit in the encoding being an escape hatch for a versioning scheme.We are able to do this, with format-preserving encryption.

Controlling one bit in the external encrypted id

We view our primary keys as pairs of 64-bit integers, where in fact the first integer is really a sequentially allocated identifier.Realistically, the very best little bit of that sequential id will be zero (i.e., the initial integers value will undoubtedly be significantly less than (2^63)).Lets ask exactly the same of our external ids.

The code in this article assumes a little-endian encoding, for simplicity (and as the world runs on little endian), however the same logic works for big endian.

Black and Rogaways cycle-walking method can efficiently fix one input/output bit: we just keep encrypting the info until bit 63 is zero.

When decrypting, we realize the original (fully decrypted) value had a zero in bit 63, and we also understand that we only re-encrypted once the output did not have a zero in bit 63.This implies we are able to keep iterating the decryption function (at least one time) until we look for a value with a zero in bit 63.

source: aes128-cycle-walk.c

This process terminates after two rounds of encryption (encode) or decryption (decode), in expectation.

Thats pretty good, however, many might prefer a deterministic algorithm.Moreover, the expected runtime scales exponentially with the amount of bits you want to control, no one really wants to turn their database server right into a glorified shitcoin miner.This exponential scaling is definately not perfect for UUIDv4, where only 122 of the 128 bits become payload:we are able to be prepared to loop 64 times to be able to fix the rest of the 6 bits.

Controlling more bits with a Feistel network

A Feistel network derives a permutation over tuples of values from hash functions on the individual values.You can find NIST tips for general format-preserving encryption (FFX) with Feistel networks, however they demand 8+ AES invocations to encrypt one value.

FFX solves a much harder problem than ours:we only have 64 bits (not) of actual information, the others is merely random bits.Full format-preserving encryption must assume everything in the input is meaningful information that has to not be leaked, and supports arbitrary domains (e.g., decimal charge card numbers).

Our situation is nearer to a 64-bit payload (the sequential id) and a 64-bit random nonce.Its tempting to simply xor the payload with the reduced items of (truncated) AES-128, or any PRF like SipHash7 or BLAKE3 put on the nonce:

BrokenPermutation(id, nonce):    id ^= PRF_k(nonce)[0:len(id)]  # e.g., truncated AES_k    return (id, nonce)

The nonce continues to be available, so we are able to apply exactly the same PRF_k to the nonce, and undo the xor (xor is really a self-inverse) to recuperate the initial id.Unfortunately, random 64-bit values could repeat on realistic database sizes (a couple of billion rows).When an attacker observes two external ids with exactly the same nonce, they are able to xor the encrypted payloads and discover the xor of both plaintext sequential ids.This may seem like a information leak, but clever folks have been recognized to amplify similar leaks and fully break encryption systems.

Intuitively, wed desire to also mix the 64 random bits with before returning an external id.That sounds nearly the same as a Feistel network, that Luby and Rackoff show that 3 rounds are very good:

PseudoRandomPermutation(A, B):    B ^= PRF_k1(A)[0:len(b)]  # e.g., truncated AES_k1    A ^= PRF_k2(B)[0:len(a)]    B ^= PRF_k3(A)[0:len(b)]        return (A, B)

This function is reversible (a constructive proof that its a permutation):apply the ^= PRF_k steps backwards order (at each step, the worthiness fed to the PRF passes unscathed), like peeling an onion.

If we let A function as sequentially allocated id, and B the 64 random bits, we are able to discover that xoring the uniformly generated B with a pseudorandom functions output is equivalent to generating bits uniformly.Inside our case, we are able to miss the first round of the Feistel network;we deterministically need exactly two PRF evaluations, rather than the two expected AES (PRP) evaluations for the prior cycle-walking algorithm.

ReducedPseudoRandomPermutation(id, nonce):    id ^= AES_k1(nonce)[0:len(id)]    nonce ^= AES_k2(id)[0:len(nonce)]    return (id, nonce)

It is a minimal tweak to repair BrokenPermutation: we hide the worthiness of nonce before returning it, to make it harder to utilize collisions.That Feistel network construction works for arbitrary splits between id and nonce, but closer (balanced) bitwidths are safer.For instance, we are able to work within the layout proposed for UUIDv8 and assign (48 + 12 = 60) bits for the sequential id (row id or timestamp), and 62 bits for the uniformly generated value.8

source: aes128-feistel.c

Again, we are able to measure the time it requires to encode (or symmetrically, decode) an interior identifier into an opaque UUID by encoding in a loop, with a data dependency between each iteration and another(source: aes128-feistel-latency.c).

The format-preserving Feistel network essentially does double the task of an ordinary AES-128 encryption, with a serial dependency between your two AES-128 evaluations.We expect roughly twice the latency, and uiCA agrees:78 cycles/format-preserving encoding on Ice Lake (in comparison to 36 cycles for AES-128 of 16 bytes).

On my unloaded 2 GHz EPYC 7713, I observe 98 cycles/format-preserving encoding (in comparison to 50 cycles for AES-128 of 16 bytes), and 26.5 ns/format-presering encoding when boosting an individual active core (13.5 ns for AES-128).

Still considerably faster when compared to a syscall, and, although doubly slow as AES-128 of 1 16 byte block, not that slow: somewhere within a L3 hit and lots from RAM.

Sortable internal ids, pseudo-random external ids: not not fast

With hardware-accelerated AES-128 (SipHash or BLAKE3 specialised for 8-byte inputs may possibly be slower, however, not unreasonably so), converting between structured 128-bit ids and opaque UUIDs takes significantly less than 100 cycles on contemporary x86-64 servers faster when compared to a load from main memory!

This post only addressed the question of runtime performance.I believe the true challenges with encrypting external ids arent strictly technical in nature, and also have more related to rendering it hard for programmers to accidentally leak internal ids.I dont understand how that could go because Ive never really had to utilize this trick in a production system, nonetheless it appears like it cant be harder than doing exactly the same in a schemas which have explicit internal primary keys and external ids on each table.Im also hopeful you can take action smart with views and user-defined types.

In any event, I really believe the runtime overhead of encrypting and decrypting 128-bit identifiers is really a non-issue for almost all database workloads.Arguments against encrypting structured identifiers should probably concentrate on system complexity, key management9 (e.g., between production and testing environments), and graceful failure when confronted with faulty hardware or code accidentally leaking internal identifiers.

Many thanks Andrew, Barkley, Chris, Jacob, Justin, Marius, and Ruchir, for helping me clarify this post, and for reminding me about things such as range-sharded distributed databases.

Read More

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button

Adblock Detected

Please consider supporting us by disabling your ad blocker