We have encountered a similar setup in some of our ML projects: the product engineers have a core Java stack that provides a series of RESTful APIs to the data which is stored in NoSQL (MongoDB), and the Data Engineers/Data Scientists author ML scripts in Python for training & feature engineering. Here are some of things that have helped us:
- The Java product engineering team build RESTful APIs to allow the data engineers to pull data (JSON/CSV format) from MongoDB on a regular cadence that we use for model development/training/feature engineering.
- We use Quilt for easily managing our datasets like code (versioned, reusable data packages) and Git for versioning pickled ML models. You can revert/jump to any saved model state or epoch (for neural networks, for example) using this approach by using
versions
and/or tags
and/or hashes
.
- The data scientists use Jupyter notebooks for reviewing/working with the data & models. You can pull in the data really easily (lazily load) into a Pandas dataframe using Quilt -- it's just a Python package:
from quilt.data.username import my_data_package
- The Java engineering team use the
exec()
method in the Java Runtime
class to process new/fresh data in batch using command line Python scripts (authored by the Python data engineers -- these are executables with command line arguments for using the right trained model & data inputs) that load the pickled models to generate predictions. See this SO answer.
I don't think this is a canonical solution or best practice per se, but it has helped us balance the needs of the Java engineering team and the Python DE/DS.
Some of the benefits of Quilt:
- Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
- Collaboration & transparency - Data likes to be shared. Quilt offers a unified catalog for finding and sharing data.
- Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed.
- Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.
- De-duplication - Data are identified by their SHA-256 hash. Duplicate data are written to disk once, for each user. As a result, large, repeated data fragments consume less disk and network bandwidth.
- Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.
Source
I hope that helps! Good luck and would like to hear how it goes.