I have trained a scikit-learn model (~70MB) which I want to use to make predictions with Apache Beam.
However, I am wondering if using parDo
will load the model for each row, hence using an enormous amount of resources
class PredictClass(beam.DoFn):
def process(self, row):
call([...]) # copy the model from remote location
model = joblib.load('model_path.pk1')
In my pipeline:
...
predict_p = (query_dbs | 'PredictClasses' >> beam.ParDo(PredictClass())
...
Is there a better way to do it? Where should I load the trained classifier?