We’ve recently posted a number of blogs about Audience Insights, propensity and the future of movie marketing. As the Team Lead behind the Insights modules, I thought this would be the right time to dive into some of the technology underpinning these advancements.
The Insights modules have been designed to bring a new level of customer intelligence to the data analytics we provide to exhibitors. We have an active team of Data Scientists at Movio consistently researching new algorithms and the value they could bring to our exhibition partners, and the Insights modules have really allowed us to put these advanced data science techniques into the hands of our customers.
The latest development is focused on movie propensity. Propensity is a term used to understand the likelihood of a moviegoer performing a certain action. In the Audience Insights module, the Propensity Algorithm™ utilizes individual and collective behavioral inputs to determine the relative likelihood that a moviegoer will see a particular movie.
We use a lot of different technologies to power the Insights modules. For the Insights modules alone, we use technology including MariaDB for the input data through to processing in Apache Spark, storage in S3 and ElasticSearch and finally Go microservices for APIs and a React-based front end. I want to highlight the analytics, so we will be focusing on Spark.
Diving Into The Key Player: What Is Spark?
The Spark website does a very good job of describing Spark as a “fast and general-purpose cluster computing system”.
At the heart of it this is all Spark is: a system for distributing computation across multiple nodes, and expressing this computation in a high-level and developer-friendly manner. Any time you have some computation that you want to be able to scale well (i.e. not just vertically) then Spark should definitely be a contender. Writing asynchronous code is one of the harder problems in software engineering, even more so when you cross machine/process boundaries. Spark provides abstractions for manipulating and processing data, so that the nitty gritty details of how to actually parallelize and distribute the execution and the data doesn’t get in your way so much1.
Awesome Parts Of Spark
As covered above, one of the main benefits of Spark is the fact that it takes care of most execution details for you, although once you get into the realm of optimization you have to start paying more attention to the execution detail. Spark will handle getting data from node to node, retrying units of computation in case of failure and data locality, all while providing logging and monitoring of the system.
If you use Datasets/Dataframes, then Spark will build an optimized execution plan out of the basic (RDD) operations. This allows you to focus on the intent, what you want to do with your data, rather than the specifics of how to manipulate the data to achieve that intent.
Spark is fast. Even from its early days, Spark has been outperforming a lot of the other major distributed computational frameworks for most use cases. Since then, even more work has gone into optimizing it and we are very happy with the performance.
Debugging / Shell
Developing in Spark is also made easier thanks to a few features. You can debug Spark with a little set-up and connect as you would to any JVM (take note of the lazy nature of transformations, though!). Although, we’ve found that in general we don’t even need to go as far as full on debugging. Spark comes with built in support for a REPL, the spark-shell. This lets you easily test and explore your data (you can run the shell locally or on a cluster). There are also notebook implementations (e.g. Databricks or Jupyter).
With these three approaches available, it is easy to step through/introspect your program and data to prototype, explore and debug.
For monitoring your cluster/jobs, you have a set of rich UIs available to you. We run our clusters using YARN, which comes with a web UI for monitoring the cluster state and Applications on it. f you dive deeper into a Spark Application you get the Spark UI, which gives you detailed information about running/completed Spark Applications. You can see the jobs that are part of the application, even the line numbers in code that triggered them (again: watch out for laziness). You can view the state of the storage (and cache), and most importantly you can drill down into a job and see its stages and the DAG of the stage execution - very useful for optimizing.
All this introspection and monitoring allows for much easier development, debugging and optimization of Spark programs. I believe it is a core strength of the framework.
Another, seemingly more ancillary strength of the framework is the rich ecosystem surrounding it. Spark fits into the Hadoop ecosystem, for example it is able to be run standalone but also with either YARN or Mesos cluster managers. For us the major benefit of this is that we can easily interop (both read and write) with many different data sources like Elasticsearch, Kafka, JDBC and parquet files.
We want to bring the power of machine learning techniques and algorithms to our customers. Movio’s Insights modules empower users with intelligent tools for finding the best audience for each movie now showing or coming soon. We were tasked with implementing the advanced algorithms from our data science team. This is helped greatly by Spark’s machine learning library (MLlib). We apply a few of our own custom methods but also make good use of some of the pre-existing functionality available in MLlib. There are three advantages here:
- The most obvious, using existing code saves time
- The code has been well tested
- Optimizations have been done on the algorithm over a naive implementation. Both engineering type optimizations to do with how Spark works and algorithmic optimizations.
I’d also like to comment on how much we were surprised by the ease and speed of development in Spark. It's obviously a large system and we thought it would be hard to get something production ready quickly. However we blew our estimates out of the water and completed the initial implementation well ahead of schedule. I think this is due to the fact that even though it is a large complex system, it does take care of so many things for you that you might be surprised by your development speed.
The distributed computing functionality of Spark might sound familiar, and it probably is to most programmers. Spark has a lot of similarities with databases, especially MPP databases. For the most part, each Spark job is declarative; a description of a set of transformations that you want to apply to some data. Those transformations are lazy and the actual way they are executed is abstracted away from the user. This is similar to database queries, e.g. SQL. SQL queries are declarative, you don’t say how to perform the query only the abstract transformations that you want to apply to the data. The query is then run through a query optimizer to produce an execution plan, which is then run on the node(s) of the database. Since Spark 1.6, if you use Datasets/Dataframes and even more so if you use SparkSQL, Spark does exactly the same process, it takes your declarative execution, optimizes it and then works out how to execute it across your hardware.
Unlike a lot of MPP databases, though, Spark runs on commodity hardware and is open source. It has a rich ecosystem around it and multiple modern APIs to interface with it.
It's great to be working with a Data Science team who are performing active research. Then using some of the latest and best tech around to actually put that data science into practice. And finally seeing how all of this comes together for a real impact in the cinema industry world wide is really fulfilling. Spark has some deep concepts but they are inherent in the problem space and well worth taking the time to properly learn. Once done, the speed of development and the actual speed in production has been a really good experience as an engineer who, at the end of the day, wants to produce something valuable.
1. For example you might hit memory issues unless you manually repartition your data, https://spark.apache.org/docs/latest/tuning.html has more details on common areas of tuning