Queues
Over the weekend, a small storm brewed on twitter: #queuegate. Rick Branson did some thought leader-ing and Tim Bray responded on his blog. It mostly looked like two people talking way past each other, but there is a kernel of truth in what each are pointing out. I’ve spent a good bit of time in the past year thinking about when and how to use queues — informed by a little theory, a little reading, and a lot of building systems with queues and watching how they fail. Queues are a powerful tool for availability, but they have limitations in what they can model.
A queue decouples message consumption from message production. By removing the temporal coupling of the producer and consumer, messages which the consumer might temporarily reject can be persisted until the consumer can consume them. If exposing producers to transient failure is problematic, a queue offers an inexpensive way out. The alternatives involve extreme levels of availability that are costly and difficult to guarantee. In my experience, the two common cases in which you cannot expose a producer to transient failure are when that producer is a customer/partner and when the production comes from a generic piece of software infrastructure (e.g. event bus).
To Branson’s credit, very few systems are modeled in such a way that dropping a queue in doesn’t really fuck them up even if we ignore the operational issues. Queues work well if and only if the relationship between the consumer and the producer is a conformist relationship with no guarantees of synchronicity. More plainly, the producer can send whatever it wants and does not give a fuck what the consumer does with it. This relationship shows up mostly in event-driven systems where some service publishes an event (e.g. UserAdded) and any listener can process this as necessary. Practically speaking, most of those systems tolerate fairly limited asynchrony, but they tolerate enough to get the benefits of queueing. So if we add a user and then need to send them a welcome email, a service that provides a welcome campaign can have a 99th percentile of 10 minutes to send the welcome email. That kind of tolerance means the service can go down for an hour+ per week which is pretty awesome.
The problem is when our systems actually have to react quickly to messages. The eventual consistency of the asynchronous conformist relationship described above spreads virally and is effectively impossible to cover up. One common failure mode with queues is when the system as a whole hasn’t bought into the asynchronous conformist relationship, but things work well until the queue backs up or the messages fail to process. This will happen at the worst possible times: at peak load or during some other failure.
Systems with empty queues can appear to be synchronous shared kernel or customer/supplier systems. Even when we don’t need to return a synchronous API call, we might be relying on our queues being “mostly synchronous”. In the most exciting cases, I’ve seen things like access limitations or security policies enforced through asynchronous queues. Organizations will often rely on these “mostly synchronous” queues to pretend they’ve gotten the benefits of queueing without changing the semantics of their systems. Systems that behave in fundamentally different ways during heavy-load or failure are inherently more error-prone. You’ll exercise those new code paths and interleavings rarely and you sure as shit won’t test for them.
The operational critique here is valid too, but it’s technology specific. Queues have spikey loads that stress them across CPU, RAM and disk usage. That being said, the same wild shifts in behavior happen at the operational level as well. It’s not an unsolvable problem, but it requires the same level of finesse and specialized knowledge as a database (or selecting a truly scalable option like SQS). For most organizations, queues still solve more problems than they create. Hard as they are to manage, achieving the same level of uptime without them is even more challenging. When you need to cheaply guarantee high uptime and can tolerate the asynchronous conformist relationship, use a queue.