NATS - A new nervous system for distributed cloud platforms
Designing multi agent systems for scale creates a number of unique challenges for messaging infrastructure. A key pattern in a pooled multi agent system is broadcast request with single response. As the number of potential responders grows large most messaging platforms just can't handle the load because of the way they are designed. Existing messaging platforms do have a lot of great tooling and resiliency features, but they often come at the expense of latency and throughput, and lead to design that comes to depend on those features.
As distributed systems scale to larger numbers of agents, a certain percentage of node failures is inevitable, and the platform has to handle this without a cascading sequence of faultsbringing down the whole system. A single disk or message consumer failure shouldn't be able to ripple failure through the system
A new low level messaging backplane was needed which would protect itself to insure availability and scale to millions of messages per second with minimal latency. The platform would also have to scale to very large numbers of clients while supporting a variety of messaging patterns. In order to do this NATS does a number of novel things including mapping the message distribution as an interest graph and pruning it aggressively to minimize unnecessary traffic between servers and on the network.
NATS was initially designed to solve a very specific distributed problem, but because of it's unique characteristics it's used heavily by companies such as Apcera, Baidu, HTC, and Pivotal, as well as a number of small projects around the world.