2

we are training machine learning models offline and persist them in python pickle-files.

We were wondering about the best way to embedd those pickeled-models into a stream (e.g. sensorInputStream > PredictionJob > OutputStream.

Apache Flink ML seems to be the right choice to train a model with stream-data but not to reference an existing model.

Thanks for you response.

Kind Regards Lomungo

Green Lomu
  • 65
  • 1
  • 9
  • potential workaround: I guess we could use flink asyncio (https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html) to send requests from the stream to a ml-api (e.g. flask) – Green Lomu Jan 02 '20 at 12:36

1 Answers1

3

There are two possible solutions depending on the model You are using:

  1. Possibly the simples idea is to create external service that will call the model and return the results and then simply call the service with AsyncFunction
  2. Use some library, again depending on Your model to load the pre-trained model inside a ProcessFunction's open method. And then simply calling the model for each data that arrived.

The second solution has two disadvantages, first You need to have the Java version of the specific library available and the other is that You need to somehow externalize the metadata of the model if You want to be able to update it over time.

Dominik Wosiński
  • 3,769
  • 1
  • 8
  • 22
  • Thanks for your response. To answer your question - there are several models: LinearReg to more complex ones. For (2) are you thinking about something like JEP (https://github.com/ninia/jep)? I understand the disadvantes you pointed out, isn't (1) handling everything in the job itself most likely more performant? – Green Lomu Jan 02 '20 at 12:46
  • Yes, you are right. If You will have the model embedded directly in the job itself, then it will certainly faster, since You don't have the additional latency introduced by waiting for the external service. How much faster it will be, this depends on the setup and where are the job and service localized. I didn't really mean embedding the whole python in Java. For example, You have the Tensorflow library available for Java too, so You can easily train Your model in Python, save the weights, add them to Your Flink project as a resource for example and then simply read them using Java Tensorflow – Dominik Wosiński Jan 02 '20 at 13:03
  • I think 1 and 2 are both good opportunities. Thanks a lot - I'll set this to solved. – Green Lomu Jan 02 '20 at 18:13
  • https://github.com/FlinkML/flink-jpmml is sometimes used as a way to load pre-trained models into Flink operators. – David Anderson Jan 03 '20 at 13:33