0

I have trained a scikit-learn model (~70MB) which I want to use to make predictions with Apache Beam.

However, I am wondering if using parDo will load the model for each row, hence using an enormous amount of resources

class PredictClass(beam.DoFn):
  def process(self, row):
     call([...]) # copy the model from remote location
     model = joblib.load('model_path.pk1')

In my pipeline:

...    
predict_p = (query_dbs | 'PredictClasses' >> beam.ParDo(PredictClass())
...

Is there a better way to do it? Where should I load the trained classifier?

Alex
  • 1,447
  • 7
  • 23
  • 48

1 Answers1

1

If you want to load some resources for the usage of your whole DoFn you should use either start_bundle method of the beam.DoFn class (implement it and load your model there) or manually implement lazy initialization. This will allow you to load model once* and then use it when Apache Beam calls process method of your implementation.

* it will be not exactly once, but you can reason about it in this way.

Here you have a great explanation with examples and some performance tests for initialization of re-usable and expensive to load objects in Apache Beam Python SDK Apache Beam: DoFn.Setup equivalent in Python SDK

Marcin Zablocki
  • 10,171
  • 1
  • 37
  • 47