Apache Beam: ParDo and ML model

Question

I have trained a scikit-learn model (~70MB) which I want to use to make predictions with Apache Beam.

However, I am wondering if using parDo will load the model for each row, hence using an enormous amount of resources

class PredictClass(beam.DoFn):
  def process(self, row):
     call([...]) # copy the model from remote location
     model = joblib.load('model_path.pk1')

In my pipeline:

...    
predict_p = (query_dbs | 'PredictClasses' >> beam.ParDo(PredictClass())
...

Is there a better way to do it? Where should I load the trained classifier?

score 1 · Answer 1 · answered Jun 07 '19 at 07:12

If you want to load some resources for the usage of your whole DoFn you should use either start_bundle method of the beam.DoFn class (implement it and load your model there) or manually implement lazy initialization. This will allow you to load model once* and then use it when Apache Beam calls process method of your implementation.

* it will be not exactly once, but you can reason about it in this way.

Here you have a great explanation with examples and some performance tests for initialization of re-usable and expensive to load objects in Apache Beam Python SDK Apache Beam: DoFn.Setup equivalent in Python SDK

Setup and teardown have now been implemented in the Python SDK as well. — robertwb, Jun 13 '19 at 15:24

Apache Beam: ParDo and ML model

1 Answers1