A programming and hobby blog.
gRPC comes up occasionally on theOrange Site, often with a redress ofgrievences in the comment section. One of the major complaints people havewith gRPC is that it requires HTTP trailers. This one misstep has causedso much heartache and trouble, I think it probably is the reason gRPCfailed to achieve its goal. Since I was closely involved with the project,I wanted to rebut some misconceptions I see posted a lot, and warn futureprotocol designers against the mistakes we made.
Mini History of gRPCs Origin.
gRPC was reared by two parents trying to solve similar problems:
- The Stubby team. They had just begun the next iteration of their RPCsystem, used almost exclusively throughout Google. It handled1010 queries per second in 2015. Performance was akey concern.
- The API team. This team owned the the common infrastructure serving(all) public APIs at Google. The primary value-add was convertingREST+JSON calls to Stubby+Protobuf. Performance was a key concern.
The push to Cloud was coming on strong from the top, and the two teams joinedforces to ease the communication from the outside world, to the inside. Ratherthan boil the ocean, they decided to reuse the newly minted HTTP/2 protocol.Additionally, they chose to keep Protobuf as the default wire format, butallow other encodings too. Stubby had tightly coupled the message, theprotocol format, and custom extensions, making it impossible to open sourcejust the protocol.
Thus, gRPC would allow intercommunication between browsers, phones, servers,and proxies, all using HTTP semantics, and without forcing the entirety ofGoogle to change message formats. Since message translation is no longerneeded, high speed communication between endpoints is tractable.
HTTP, HTTP/1.1, and HTTP/2
HTTP is about semantics: headers, messages, and verbs.
HTTP/1.1 is a mix of a wireformat, plus the semantics (RFCs 7231-7239). gRPC tries to keep the HTTPsemantics, while upgrading the wire format. Around 2014-15, SPDY was beingtested by Chrome and GFE as a work around for problems with HTTP/1.1.Specifically:
- Most browsers limit connection counts to a domain to 2-6. This meansthere can be at most 2-6 active requests.
- Pipelining breaks many many devices that neither the end-user nor theserver can control. A failure in a pipeline request causes the entireconnection to be severed.
- Head-of-line blocking. A slow response in a pipeline prevents the loadof other, ready responses.
- Authentication tokens, cookies, and other headers have become enormous.The headers often exceed the size of the body.
Acting on the promising improvements seen in the SPDY experimentation, theprotocol was formalized into HTTP/2. HTTP/2 only changes the wire format,but keeps the HTTP semantics. This allows newer devices to downgrade thewire format when speaking with older devices.
As an aside, HTTP/2 is technically superior to WebSockets. HTTP/2 keepsthe semantics of the web, while WS does not. Additionally, WebSocketssuffers from the same head-of-line blocking problem HTTP/1.1 does.
Those Contemptible Trailers
Most people do not know this, but HTTP has had trailers in the specificationsince 1.1. Thereason they are so uncommonly used is because most user agents dont implementthem, and dont surface them to the JS layer.
Several events happened around the same time, which lead to the bet onrequiring trailers:
- HTTP/1.1 had semantic support for trailers.
- HTTP/2 had just been newly minted, and had wire support for trailers
- The fetch API had just addedsupport for trailers
The thinking went like this:
- Since we are using a new protocol, any devices that use it will need toupgrade their code.
- When they upgrade their code, they will need to implement trailer supportanyways.
- Since HTTP/2 mandates TLS, it is unlikely middleboxes will error onunexpected trailers.
Why Do We Need Trailers At All?
So far, weve only talked about if its possible to use trailers, not if weshould use them? Its been over two decades, and we havent needed them yet,why put such a big risk into the gRPC protocol?
The answer is that it solves an ambiguity. Consider the following HTTPconversation:
GET /data HTTP/1.1Host: example.comHTTP/1.1 200 OKServer: example.comabc123
In this flow, what was the length of the
/data resource? Since we donthave a Content-Length, we are not sure the entire response came back. If theconnection was closed, does it mean it succeeded or failed? We arent sure.
Since streaming is a primary feature of gRPC, we often will not know thelength of the response ahead of time. HTTP aficionados are probably feelingpretty smug right now: Why dont you use
Transfer-Encoding: chunked? Thistoo is insufficient, because error can happen late in the response cycle.Consider this exchange:
GET /data HTTP/1.1Host: example.comHTTP/1.1 200 OKServer: example.comTransfer-Encoding: chunked6abc1230
Suppose that the server was in the middle of streaming a chat room messageback to us, and there is a reverse proxy between our user agent and the server.The server sends chunks back to us, but after sending the first chunk of 6,the server crashes. What should the Proxy send back to us? Its too lateto change the response code from 200 to 503. If there were pipelined requests,all of them would need to be thrown away too. If this proxy wanted to keep theconnection open (remember connections cost a lot to make), it would not wantto terminate it, for an arguably recoverable scenario.
Hopefully this illustrates the ambiguity between successful, complete responses,and a mic-drop. What we need is a clear sign the response is done, or a clearsign there was an error.
Trailers are this final word, where the server can indicate success or failurein an unambiguous way.
Trailers for JSON v.s. Protobuf
While gRPC is definitely not Protobuf specific,it was created by people whohave been burned by Protobufs encoding. The encoding of Protobuf probablyhad a hand in the need for trailers, because its not obvious when a Protois finished. Protobuf messages are a concatenation of Key-Length-Values.Because of this structure, its possible to concatenate 2 Protos together andit still be valid. The downside of this is that there is no obvious pointthat the message is complete. An example of the problem:
syntax = "proto3";message DeleteRequest string id = 1; int32 limit = 2;
The wire format for an example message looks like:
Field 1: "zxy987"Field 2: 1
A program can override a value by adding another field on:
Field 2: 1000
The concatenation would be:
Field 1: "zxy987"Field 2: 1Field 2: 1000
Which would be interpreted as:
Field 1: "zxy987"Field 2: 1000
This makes encoding messages faster, since there is no
size field at thebeginning of the message. However, there is now a (mis-)feature where Protoscan be split or joined along KLV boundaries.
JSON has the upper hand here. With JSON, the message has to end with a curly
} brace. If we havent seen the finally curly, and the connection hangs up,we know something bad has happened. JSON is self delimiting, while Protobufis not. Its not hard to imagine that trailers would be less of an issue, if thedefault encoding was JSON.
The Final Nail in gRPCs Trailers
Trailers were officially added to the
fetch API, and all major browsers saidthey would support them. The authors were part of the WHATWG, and worked atthe companies that could actually put them into practice. However, Google isnot one single company, but a collection of independent and distrustingcompanies. While the point of this post is not to point fingers, a singleengineer on the Chrome team decided that trailers should not be surfaced up tothe JS layer. You can read the arguments against it, but the short versionis that there was some fear around semantic differences causing securityproblems. For example, if a
Cache-Control header appears in the trailers,does it override the one in the headers?
I personally found this reason weak, and offered a compromise of treating themas semantic-less key-values surfaced up to the
fetch layer. Whether itsbecause I was wrong, or failed to make the argument, I strongly suspectorganizational boundaries had a substantial effect. The Area Tech Leads ofCloud also failed to convince their peers in Chrome, and as a result,trailers were ripped out.
Lessons for Designers
This post hopefully exposed why trailers were included, and why they didntwork ultimately. I left the gRPC team in 2019, but I still think fondly ofwhat we created. There are gobs of things the team got right; unfortunatelythis one mistake ended up being the demise. Some takeaways:
- Organizational problems are harder than technological ones. Solve theharder problems first. If we had met with the Chrome team years earlier,we could have designed around this road block. As the saying goes,Weeks of working can save hours of planning.
- Updating code is nearly impossible. Compatibility with the existingsystem matters more than all the features and performance improvements.The best protocol is the one you can use.
- Focus on customers. Despite locking horns with other orgs, our team hada more critical problem: we didnt listen to early customer feedback. Wecould have modified the servers and clients to speak an updated versionof the protocol that obviated the need for trailers. (theres even room inthe gRPC frame for it!). It was our lack of sympathy that sank us,ultimately.