As we've alluded to in previous blogs, like our Docker series, we are in the process of bringing our systems into the brave new world of the microservice.
One of the pivotal technologies enabling this shift is the Apache Kafka message queue. It has not only become a critical part of our infrastructure, but it informs the architecture of the systems we are creating.
While the aim of this post isn't to sell you Kafka over any other queueing system, some parts are specific to it. For the uninitiated, Kafka is an open source distributed message broker. It was originally developed by LinkedIn, and is currently maintained by the Apache Software Foundation. As with any message broker, you can send messages to it, and receive these messages from it. In Kafka parlance a "producer" sends messages, and a "consumer" receives them.
What makes Kafka unique is that the humble file system is what bridges these two actions. A Kafka broker's fundamental job is to write messages to a log on disk, and to read messages from it, as fast as possible. Persistence of message queues isn't bolted on after the fact here; it's the core of the whole project.
Queues in Kafka are called "topics", which are sharded into 1 or more "partitions". Each message has an "offset" - a number representing its position within its partition. This allows consumers to keep track of where they have read up to, and ask the broker for the next message (or messages). Multiple consumers can read from the same partition at the same time, each reading from a position independent of the others. They can even jump forward or backwards through the partition at will.
Multiple Kafka brokers are joined together in a cluster. Partitions are spread across the cluster, providing scalability. They are also replicated to multiple nodes in the cluster, giving high availability. Combine replication with partition persistence and you get high durability.
For more detail, have a read here.
Let’s Build a Microservice Architecture
To help us explore the uses and influence of Kafka, imagine a system that receives data from the outside via a REST API, transforms it in some way, and stores it in a database.
We want a microservice architecture, so let's split this system into two services - one to provide the external REST interface (Alpha service), and another to transform the data (Beta service). For simplicity’s sake, the Beta service will also be responsible for storing the transformed data.
A system of coupled microservices is little better than a monolithic service (often it's worse!), so we'll help to limit coupling by defining an interface for the Alpha service to send data to the Beta service. Let’s see how using Kafka to implement this interface would affect the design and operation of this system, as compared to using REST.
System Availability and Performance
When data is submitted to the Alpha service, it needs to respond to the client only when it is sure the data is safely stored somewhere, or has failed - the client needs to know if there has been an error so it can resend (or recover in some other way).
With a REST interface, the Alpha service would need to wait for a response from the Beta service before responding to its client as data doesn't make it into a database until the Beta service.
There's a couple of problems with this. Firstly, the Alpha service now requires the Beta service to be up and responding to serve requests - its uptime is dependent on the Beta service. Secondly, the Alpha service can't respond until after the Beta service has responded - its performance is also dependent on the Beta service!
In both cases, the Alpha service is coupled to the Beta service. How often does the Beta service fail? How often is it down for maintenance? What's its peak performance? Does it actually only respond once the data is safely stored, or does it cheat and respond earlier? Does it depend on any other services, in turn, extending the chain of coupling further? A system like this is only as good as its weakest service.
If we use a Kafka queue as our interface instead we get quite a different story, as Kafka has a trick up its sleeve: a Kafka queue is persistent. The Alpha service can respond as soon as the data has made it safely onto a queue; we can be confident the data will eventually make its way into the database as the Beta service works through the queue. The Alpha service is now only dependent on the uptime and performance of Kafka - both are likely to be far higher than any other microservice in the system. It is so loosely coupled to the Beta service it doesn't even need to know it exists!
Of course, a queue is never a panacea for performance problems - it won't magically improve the overall performance of the system (hands up if you've ever had to sit in a meeting with someone who believed that). However, it does allow you to deal with variable load coming from external systems, or even from microservices in your own system (e.g. ones that have to do some form of batch processing). The queue absorbs load spikes, allowing downstream services to continue processing at a smooth rate. The performance of a given service only needs to be greater than the average load on the system, not the peak load.
When using a REST interface you can break the dependency chain and achieve similar effects by storing data in each of the services. However, at that point you are effectively implementing a message broker yourself in each service. Using an existing service frees you from having to design, build and maintain one (or many) yourself.
Resisting Service Coupling
While we're designing our system we decide that some clients might find it handy if the Alpha service returned the transformed version of the data in its response.
This would be quite simple using the REST interface - the Beta service could return the transformed data, and the Alpha service would return it to the client in turn. However, we've just added a new dependency between the two services - the Alpha service now relies on the transformation functionality of the Beta service. This is a similar situation to the above section where the Alpha service depended on the Beta service to store data. The same kind of issues arise: if the Beta service can't perform its function in a timely and correct manner, neither can the Alpha service.
Unlike with data storage, Kafka doesn’t provide an easy solution to this issue. In fact, implementing this with a Kafka interface would be more complex (though by no means impossible). As a queue is a one-way communication channel, getting transformed data back from the Beta service requires a second queue in the opposite direction. Matching up these asynchronous responses to waiting clients would also be a bit of work.
However, the fact that this is not easily done with queues gives us a hint that something is not well in our nascent system, and we should re-assess. Looking at the how the extra feature we imposed on the system alters the data flow shows that no matter how it is implemented, the feature introduces coupling between the services. We can't architect around the fact that if the Alpha service has to return the transformed data it must be dependant on the the service that does the transformation.
This allows us to stop and examine if this feature is actually required. Is there another way to provide access to the transformed data that doesn't introduce this issue? Do clients require the transformed data at all? If the feature is non-negotiable it suggests our microservice split isn't optimal, and we should consider moving the transformation step to the Alpha service. If not, we've just prevented ourselves from adding an unnecessary or poorly designed feature that could have hamstrung the future development and performance of our system.
Features are often very hard to remove once added. Identifying problematic features (or feature designs) before they are implemented is essential to avoiding costly baggage in a system. The limitations imposed by Kafka queues help guide us to well designed systems. While they can be frustrating when you’re trying to rush out an urgent feature, they are often appreciated in the long run.
Increasing Availability and Scaling
Our system should be highly available, and scalable as load increases. As part of achieving this we want to be able to run multiple instances of our Beta service. If one instance fails, others will (hopefully) still be operating. More instances can be added to deal with increased load.
With a REST interface we can put a load balancer in front of the Beta service instances, and point the Alpha service at the load balancer instead of directly at a Beta service instance. The load balancer will need to be able to detect that an instance has failed so that it can direct load elsewhere.
A Kafka queue supports a variable number of consumers (i.e. Beta services) by default, no extra infrastructure required. Multiple consumers can be joined together to form a "consumer group", simply by specifying the same group name when they connect. Kafka will spread the partitions of any topics they are listening to across the group's consumers. So long as the topic we are using for our interface has enough partitions we can keep adding Beta service instances, and the load will be spread across them. If some fail, their partitions will be picked up by the remaining instances.
Adding More Services
We need to add some new functionality (doesn’t really matter what) to our system, which we’ll put in a new Gamma service. It relies on data from the Alpha service in the same way as the Beta service, and this data is provided via the same interface. Data must be processed by the Gamma service to have been fully processed by the system.
With a REST interface, the Alpha service is now coupled to an additional service, compounding the previous problems. The more downstream services you add, the more dependencies there are.
Further, a set of data could be processed successfully by one downstream service, but fail to be processed on another. These situations can be difficult to communicate to clients or recover from automatically, particularly if they can leave a system in an inconsistent state.
Using a Kafka interface allows the new Gamma service to simply read from the same queue as the Beta service. So long as it uses a different consumer group to the Beta service they won't interfere with each other. As the Alpha service doesn't need to know what services use the queue it writes to we can keep adding services without affecting it at all.
Data is safely stored in Kafka, so individual services can retry in the event of transient failures, such as database deadlocks or network issues. Non-transient failures, such as bad data, can still pose similar challenges to those seen with the REST interface, however. Checking that data conforms to system requirments as early as possible, e.g. in the Alpha service, is essential for minimising these issues.
All of the above examples are taken directly from our own journey towards microservices at Movio. It has proven itself an invaluable tool in the process, helping to produce quality architectures, and implement them simply and quickly. We continue to discover new ways to utilize it, and expect our usage to only increase as we develop our systems further.
The teams at LinkedIn and Apache have produced a great piece of kit.
Read part 2 in this series: Microservices: Judgement Day.