In our blog post, Microservices: The rise of Kafka we talked about how at Movio we prefer to use Kafka for communication between microservices. This post will cover an approach to data validation when you use queues (Kafka) to communicate between your microservices.
We take in a lot of data from external sources. We receive/pull in this data through a few different channels, to accommodate the particular source. This data is often interrelated; for example, an entity representing a person might be related to their favorite cinema. Since the incoming data is outside of our control, we can’t be sure these entity relations are valid. Valid here means that the related cinema has been sent to/received by our systems. If we don’t know about the related cinema, we can’t be sure that it actually exists, it could be a data consistency issue on the client side. Every microservice that wants to make use of this data has to account for these possible missing dependencies.
What if we could guarantee the external data is consistent, that all dependencies are satisfied? This is the job of what we call the staging microservices architecture. A staging microservice converts a stream of “raw” entities into a stream with this dependency guarantee. Raw in this context means there is no guarantee that the dependencies are valid. Now downstream microservices can work with the validated stream of data, without having to perform their own checks.
Fig 1. Each microservice performs its own validation
Fig 2. Validation is performed once, each microservice is now smaller, focusing on its own task
For practical reasons the output stream from a staging microservice can’t be exactly the same as the input raw stream. What should you do when a “raw” entity is not valid, i.e. one of its dependencies does not exist? Do you block the whole stream of data until we validate that entity? What if we never receive one of the dependencies?
To handle this we relax our definition of the output stream and make the following guarantees:
- For a given entity the order of updates to that entity in the raw stream will be preserved in the output stream
- Any entity in the raw stream that is valid will be put on the output stream eventually
- Any entity in the output stream has had all of its dependencies satisfied
Note the eventually in point 2; this allows us to get around our problem of receiving an entity we can’t validate. We save the raw entity, awaiting future validation, at this point we consider the entity “staged”. If/When the entity becomes valid we can send its updates in order onto the output queue.
First, we must understand the inputs to a staging microservice. The obvious one is the raw stream of entities it is supposed to validate, however, we also need more. For validation to be performed, the service must be able to check for the existence of any dependencies. Take our favorite cinema example; the staging microservice that is validating a member must be able check the cinema the member is linked to exists (within Movio systems).
We could synchronously contact another microservice as and when we need to check if a dependency exists. If you read our Kafka blog post you’ll see that we try to avoid this kind of direct coupling between services. Indeed, there is a much cleaner solution available. Assuming we have a stream of validated data for each dependency, if our staging microservice also took these streams as an input it would be able to build a local cache of the dependencies. Now, instead of going to another service, the staging service can check locally to see if dependencies have been fulfilled. Our staging services produce validated streams! Streams of entities that do not have dependencies can be considered validated without needing to go through a staging service.
Note: For a simple foreign key style relation we only need to store the dependencies’ ID in our cache, making the storage overhead quite small. In exchange, our microservices are better decoupled and the speed of checks is greatly increased.
So we have two sets of inputs to our staging microservice: The input raw stream and The validated dependency streams and there are two algorithms to process these:
The raw stream:
- Read in external data, receiving raw entities
- Check local cache to see if all the dependencies exist
- If they do, send the entity on to the output queue
- If they don’t, stage the entity
The dependency stream:
- Read in the dependency
- Save to local cache
- If the dependency wasn’t in the cache, trigger a recheck of any staged entities that depend on it
Fig 3. The staging service architecture in more detail
The staging microservices move all dependency checking logic into one place. This has several advantages, which are similar to putting repeated logic into a function rather than copy/pasting across your code.
The most obvious is that you do not need to repeat the dependency checking logic in each microservice that wants to work with external data. This guarantees checking is done in a consistent manner across our microservices, while at the same time simplifying the downstream microservices so they can focus on the logic specific to them.
Our staging microservices have a single responsibility- to guarantee that dependencies are valid. As the functionality the staging microservices perform is simple, the code is also simple, making bugs easy to find and fix.
Following on from this, it is also much easier to monitor and alert on the state of the staging microservices. We can track the time entities spend staged, along with the reasons they have failed validation, i.e. which dependencies are missing. This makes it easier to track down any inconsistencies in the external data.
You may have noticed a theme here; I consider the main advantage of this approach to be the simplification of our services, making them easier to reason about. The final bonus, that I feel is greatly helped by this, is the scalability and performance of the microservices. Both performance optimization and concurrency are hard problems, so the easier it is to reason about the underlying logic of the code you are working on the less likely you are to make mistakes and the more you can optimize / parallelize your solution. We handle quite large volumes of data so optimizing / parallelizing our processing of it is of great importance. We can, and have in fact made, our staging microservices capable of scaling at the granularity of a single entity’s ID.
Our actual current implementation of the staging microservices makes use of Kafka for our streams, Akka Actors for the processing and Elasticsearch for our local store/caching. We don’t lean too heavily on Akka or Elasticsearch and could easily swap them out. Our use of Kafka, however, is central to the services’ ability to scale and recover from failure. The scaling is nearly transparent through Kafka topic partitioning. Recovery is just as simple through offset committing.
Hopefully this has been an interesting look into how we at Movio are using Kafka and microservices architecture to ingest, validate and process external data. This concept of using a microservice to add guarantees to a stream is not just limited to dependency checking, it can also be applied to transforming the data or internally validating it, e.g. a field can’t be longer than a certain length.
I’d also like to end on a more general note. Using microservices architecture often seems to add a certain level of complexity to your solutions. I think there is a reasonable amount of infrastructure overhead that is added by microservies; in our case Kafka and Elasticsearch. Other complexities mainly arise from being forced to handle concurrency and to better decouple your solution. If you need the scalability microservices can provide, you will have to deal with these concurrency issues (and some of the infrastructure overhead) anyway. But by applying the same techniques as you would to normal code (e.g. single responsibility principle, DRY and code reuse) you can reduce the complexity of your overall system in the same ways, gaining similar benefits.