After two years of development, we have created a data platform that has enabled the following:
- Data scientists own and maintain their insights running in our products end to end
- The turnaround time for getting insights into our products is on the order of hours/days rather than weeks/months
- Our engineering and data science teams are able to interact and collaborate over a shared catalog of datasets and machine learning models using a common set of tooling
In my role as Lead Data Engineer, I'm tasked with bridging the divide between our data science team; and the engineering unit running our products. The journey of an insight from idea into the hands of our customers is often labored and drawn out because of it.
The source of this divide is a variation of concerns. Engineering needs to ensure the product is reliable, stable, and secure; that the data is handled with the highest level of governance and care. Data science needs to be able to quickly iterate, measure, and explore our data to maximize its value to the customer. Because of this difference in concern, engineering can be seen as a bottleneck to innovation.
So how was our new workflow achieved? We had to focus on two key aspects of the divide. First, enhancing the boundary between engineering and data science to encourage specialization. Second, streamlining the interaction over that boundary with tooling.
Specialization is key to quality insights
Quality insights and predictive models come from dedicated Data Scientists focused on data science. Data science is an incredibly cross-functional discipline. In addition to staying on top of emerging technologies and algorithms in the data science community at large; data scientists must also remain immersed in the business domain they operate in, and maintain a working knowledge of all of the tooling at their disposal. Unsurprisingly, there is a high degree of pragmatism in the field. The exploratory nature of data science research, paired with this pragmatic perspective, presents a very different culture to that of a typical engineering team. Modelling the present to predict the future requires rapid iteration, throw away code, constant evaluation and tuning.
Compare and contrast this culture to that required to run a production system serving customers. Systems like that need to meet a slew of base requirements before getting anywhere near the business specific ones:
Reliability: customers expect our services will be available and ready to meet their needs;
Security: a non-negotiable, especially in a position as custodians of our customer's data;
Data usage: ensuring our data is used as our data providers intend (amplified with multi-tenancy);
Maintainability: production systems have long lifespans, and should be designed to cope with engineer churn;
Cost optimization: long running systems/processes incur an ongoing cost, and so must be as efficient as possible.
Now my point isn't that data scientists aren't capable of running systems like that. In fact in many organisations they do. There's a growing list of tools and third party services aiming to facilitate exactly that (e.g. AWS SageMaker). The point is that they shouldn't be required to. Research requires time and space; separation from other problems for a time. It shouldn't be surprising that a data science unit has little output or innovation if they are burdened with running production systems themselves.
How do we let data scientists be data scientists? I would argue that we first need to remove the burden of production and data preparation as much as possible. A cleaner separation goes something like this:
- Data scientists are responsible for the correctness of their predictive models and datasets;
- Data engineers are responsible for reliably executing those models and datasets in production
In this separation, data scientists are still accountable for the quality of their datasets and predictive models. If a model drifts or a dataset is backed by an erroneous algorithm/approach, they take responsibility for the fix. However, they are no longer burdened with the responsibility of reliably provisioning computing resource, ensuring correct access to data, cost management, or security. That responsibility falls on a data engineering team. Data engineers also serve another key function: propagating best practice and design for datasets and models back to data science.
At Movio we have a dedicated data engineering team that acts as an interface between the data science team and the rest of engineering. This frees up our data science team to focus on exploring new insights and predictive models, which has allowed us to stay at the cutting edge of moviegoer behavior prediction.
Streamlining the interaction between data science and engineering is the key to rapid innovation
A lurking danger in enabling specialization is that this may also lead to a communication breakdown between the two units. Drivers that can often lead to a sense of ‘siloism’ include:
Tooling: data engineers and data scientists will naturally gravitate to different sets of tooling to solve their respective problems. Data engineers will typically use standard software engineering tools (e.g. compilers, version control, IDEs, etc). Data scientists will typically use notebooks (e.g. Jupyter notebooks) running a data science friendly ecosystem like python or R.
Perspective: data scientists are focused on what can be done with the data, iterating rapidly and experimenting. Data engineers are focused on the governance of data, and efficient reliable pipelines and services;
Pace: data engineers will typically operate slowly when making changes to pipelines and services. Data scientists will often move quickly, often writing throwaway code.
Siloing can have devastating effects on the pace of innovation. Consider the life-cycle of an insight or predictive model that makes it into a production system. Most will follow some variation of the following phases:
1. Obtain access to the raw source data required, clean up and prepare the data for use (ETL);
2. Experiment and iterate with the computation and data, testing hypotheses and hunches;
3. Loop back to 1 until the insight and data looks like it might be good;
4. Refactor/re-engineer the insight or predictive model to meet the requirements of production;
5. Deploy the insight/model into production; and
6. Measure the efficacy of the insight/model as it’s used in the wild
Our previous workflow at Movio followed a somewhat typical script for the productionization of insights/models. The data engineering team managed a data warehouse and also hosted services that allowed other engineering teams to query datasets. From the perspective of a data scientist, the workflow looked something like this:
0. (Sometimes needed) request that a new data source be added to the data warehouse;
1. Using a Jupyter notebook, read from the data warehouse and iterate on the outcome they were aiming for;
2. Once the dataset/model was ready, reach out to the data engineering team to have it productionized.
There were a number of pain points with this workflow:
- The data engineering team often had a large backlog of work, so scheduling time for implementation work would already introduce a significant delay;
- Once data science had a notebook containing their candidate for production, the data engineering team would need to spend time understanding the technologies used;
- The data engineering team would rewrite the algorithm to meet the requirements of production (and often use their own tool stack in the process);
- The data science team would lose connection with their insight/model in production
By focussing on these pain points, we have been able to streamline our workflow. By augmenting the interaction with tooling, we have encouraged more interaction between our teams by making it as effortless as possible
We have developed a platform that supports both specialization and streamlined interaction. This has been achieved by:
- Standardizing on shared tooling for dataset and model creation (Jupyter notebooks, python ecosystem)
- Introducing a common ground for engineering and data science through a shared catalog
- Abstracting the storage and execution of datasets/models from their definition so that the responsibilities of data engineers and data scientists are clearly defined
Our new workflow is as follows:
1. Data science creates new dataset/model definitions within a shared catalog (using Jupyter notebooks and minimal metadata files)
2. They request their ‘branch’ of work within the catalog be merged into the production branch
3. Data engineering reviews the request using the same tooling (providing feedback if required)
4. Once ready, the request is merged into the production branch, and the dataset/model is immediately executed in production
Maintaining separate data engineering and data science teams has allowed us to innovate in both areas. Pairing this with the augmented interaction between the two units has allowed us to shift the pace we deliver new datasets/models into production. We've already had success with this approach in developing Movio Metrics in record time. As our tooling improves, the quality and quantity of new insights and predictive models will only increase.
In a future post we will deep dive into the technology of our platform. It's an exciting time to be in data science and data engineering here at Movio! Keep an eye on our Careers page for upcoming opportunities to join the Movio Crew.