Code-base design best practices for a long term effective collaboration between data scientists and engineers

Question

Some context:
Core product stack is in Java and is exclusively maintained by engineers. Data scientists are responsible for proposing algorithmic improvements / additions.
Data scientists do not code in Java but use Python for their work. They collaborate with the engineering team to get the ideas expressed as Python code implemented in the core Java code-base.

The problem:
Vetting new ideas depends on the engineering team writing new Java code. This slows down the speed of iteration of the entire tech team.

What architecture would alleviate this issue?

please can you provide bit more clarity on why data scientist need to propose code for the main product. dont they work on data and use their own algorithms etc — Preet Singh, Jul 09 '18 at 15:24
Your comment is on the policy of scope of roles. But I will humor you. Under a broadened interpretation of the data scientist role - proposing / experimenting with novel algorithmic approaches for the main product is also part of their work, besides doing post-hoc data analysis, etc — pX0r, Jul 13 '18 at 06:13

score 0 · Answer 1 · answered Jul 24 '18 at 19:59

We have encountered a similar setup in some of our ML projects: the product engineers have a core Java stack that provides a series of RESTful APIs to the data which is stored in NoSQL (MongoDB), and the Data Engineers/Data Scientists author ML scripts in Python for training & feature engineering. Here are some of things that have helped us:

The Java product engineering team build RESTful APIs to allow the data engineers to pull data (JSON/CSV format) from MongoDB on a regular cadence that we use for model development/training/feature engineering.
We use Quilt for easily managing our datasets like code (versioned, reusable data packages) and Git for versioning pickled ML models. You can revert/jump to any saved model state or epoch (for neural networks, for example) using this approach by using versions and/or tags and/or hashes.
The data scientists use Jupyter notebooks for reviewing/working with the data & models. You can pull in the data really easily (lazily load) into a Pandas dataframe using Quilt -- it's just a Python package:

from quilt.data.username import my_data_package

The Java engineering team use the exec() method in the Java Runtime class to process new/fresh data in batch using command line Python scripts (authored by the Python data engineers -- these are executables with command line arguments for using the right trained model & data inputs) that load the pickled models to generate predictions. See this SO answer.

I don't think this is a canonical solution or best practice per se, but it has helped us balance the needs of the Java engineering team and the Python DE/DS.

Some of the benefits of Quilt:

Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
Collaboration & transparency - Data likes to be shared. Quilt offers a unified catalog for finding and sharing data.
Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed.
Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.
De-duplication - Data are identified by their SHA-256 hash. Duplicate data are written to disk once, for each user. As a result, large, repeated data fragments consume less disk and network bandwidth.
Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.

Source

I hope that helps! Good luck and would like to hear how it goes.

Code-base design best practices for a long term effective collaboration between data scientists and engineers

1 Answers1