Hi, I’m Nicolas Maquet, one of the engineering leads at Movio. As we’ve written previously, we are big fans of Apache Kafka at Movio. We use it for many applications, from metrics aggregation, to alerting, to SMS and email sending, to change data capture (CDC) and real-time extract transform load (ETL) pipelines. In a few short years, Kafka has become the central communication platform for most services in our company.
Stream Processing at Movio and Apache Samza
A lot of our services have a relatively simple usage pattern of Kafka: i.e. single-topic consumers or producers that use Kafka as a communication link. On the other hand, our ETL and CDC services have a more involved usage of Kafka: they typically read many topics, join them together, which requires maintaining state, and output the joined result to output topics. For this use case, we have made extensive use of Apache Samza, a stream-processing framework originally from Linkedin and now a top-level Apache project.
Looking back, Samza has worked really well for us. It has been over a year since our first Apache Samza experiments, and we are now running about 25 Samza applications in production and have not had any major issues with the framework. Samza has many traits that make it an excellent choice for a stream processing framework:
- It’s a top-level Apache project.
- Its documentation and API are absolutely stellar. One of the best we have seen.
- It was originally developed by the same people that created Apache Kafka.
- It is written in Scala, and most Movio devs know Scala really well.
- Given sufficient hardware, its performance is excellent.
For those reasons and others, and after lengthy comparisons with competing frameworks such as Apache Storm, Apache Flink and Spark Streaming, Samza felt like the best choice for us in early 2016 and we have stuck with it.
At Movio, our engineering culture is changing rapidly. We used to be known as an FP/Scala shop; now many of us hang out at Golang meetups. We used to develop monolithic applications on the JVM; now we run hordes of Golang microservices running on Kubernetes. One of our company mottos is ‘never satisfied’ and it sums up our engineering culture pretty well. In that spirit, we have recently revisited our choice of Apache Samza, and have listed all the pain points that we’ve encountered (a sobering exercise, particularly if you’re a fan of the technology that is put into question):
- Samza is hungry for CPU and RAM. The simplest Samza app requires two JVM processes, which will typically take up at least 100MB of RAM each. Samza is particularly RAM-intensive when restoring RocksDB stores, frequently requiring allocations of 6GB to 8GB of RAM in our experience.
- Samza’s philosophy of doing stateful stream processing by having each processing node require its own local state is excellent for performance but it comes at a large cost. Samza uses RocksDB key-value stores that require super-fast SSDs and are very hard to tune properly (RocksDB developers themselves lament this fact). Also, Samza doesn’t clean up after itself when moving processes around in the cluster. Carefully monitoring disk usage and having to manually clean up hundreds of gigabytes of RocksDB files on a regular basis is no fun.
- Samza applications require a Hadoop cluster to run. While Hadoop is a fantastic piece of technology, all other clustered Movio applications run on Kubernetes and the need to operate Hadoop cluster just for Samza is a bit of a sore point.
- Samza’s development pace is relatively slow. Samza has seen only two releases in the 12-month period since we’ve started using it. We’ve contributed a few bug fixes to Samza but we have had to run our own builds to benefit from them, which isn’t ideal.
- Samza apps are written in Java or Scala, which we are slowly moving away from with a preference for Go.
Building our own
Today we’re announcing the first Beta release of Kasper, a Go library for Kafka stream processing, available under the open source MIT licence. Kasper is heavily inspired by our experience with Apache Samza, but we did make some different design choices along the way. Let’s start by outlining what Samza and Kasper have in common:
Single-threaded concurrency model
Kasper runs a single-threaded event loop, much like Samza, and always calls back the user’s application code from a single thread. This means that you don’t need to worry about thread safety or race conditions when interacting with Kasper.
Declarative Kafka interaction
Much like Samza does, Kasper abstracts away most of the interaction with Kafka, requiring you only to configure a few options (broker address, retries count, required number of acks, etc.). Kasper deals with reading and writing messages to and from topics (we use Shopify’s excellent sarama Kafka library for that), and handles the marking of offsets.
Scaling out to multiple processes
In a similar fashion to Apache Samza, Kasper applications scale out to multiple processes by assigning a subset of each input topic’s partitions to each process.
For all its similarity with Apache Samza, we did make a number of different design choices when writing Kasper. Here are the most important ones:
Samza includes a “stream” abstraction, which can be a Kafka topic but can also be other things. We personally found this abstraction rather redundant and more confusing than helpful, so we’ve made Kasper Kafka-specific. The Kasper API explicitly references Kafka concepts such as topic, partition and offset and doesn’t hide them behind an abstraction layer.
Samza uses a distributed state-management strategy based on RocksDB stores that keep a changelog on Kafka for fault tolerance. While performant, this design is expensive as it requires super-fast SSDs on each compute node and a lot of disk space on the Kafka cluster. Kasper takes the opposite approach and is designed to work with a centralized key-value store instead. Kasper currently supports Elasticsearch and Redis, and additional support for Cassandra is planned.
Micro-batch processing model
Samza processes Kafka messages one at a time, which is ideal for latency-sensitive applications, and provides a simple and elegant processing model. However, since Kasper uses a centralized key-value store, processing messages one at a time would be prohibitively slow. Therefore, Kasper uses a micro-batch processing model. This typically introduces a few seconds of latency to the processing pipeline, but allows for high-throughput processing. From a developer’s perspective, writing Kasper applications is thus a bit more complicated than writing an equivalent Samza application, since it requires you to process a batch of messages efficiently by using bulk operations on the remote key value store.
Kasper doesn’t provide (or depend on) an execution environment
Kasper is agnostic on the execution environment. We run our services in Docker on Kubernetes but you’re free to run Kasper applications standalone on a single machine or to use other container orchestration tools. Note that this means that contrary to Samza, Kasper applications aren’t fault-tolerant out of the box and rely on external tools for that.
Just the beginning
Our first results with Kasper have been really encouraging. We use Elasticsearch as a centralized key-value store and yet our Kasper applications are only about 30% slower in terms of throughput than what we would expect with Samza (using the same number of processes). At Movio, a typical stream-joining application that used to take multiple gigabytes of RAM with Samza now runs comfortably in under 40MB with Kasper. The additional Elasticsearch cluster does take a few gigabytes of RAM but it can be shared accross multiple applications. As an additional benefit, Kasper applications can be shut down and restarted with virtually no downtime since no expensive state copying / restore is needed. This is especially useful for scaling an application up or down.
Kasper has just now entered beta and its API isn’t fully stable yet. Expect development to move fast and a first stable release to come in the coming months. We welcome contributions and advice on the API stabilization effort (the easiest is to create a GitHub issue).