2

I'd like to know if there is a way(or some sort of code example) to load an encoded pre-trained model (written in python) inside a Flink streaming application. So I can fit the model using the weights loaded from the file system and the data coming from from stream.

Thank you in advance

elton stafa
  • 53
  • 1
  • 3

1 Answers1

2

You can do this in a number of different ways. Generally, the simplest way would be to simply invoke the code that downloads the model from some external storage like s3 for example in the open method of your function. Then You can use the library of Your choice to load the pre-trained weights and process the data. You can look for some inspiration here, this is the code for loading model serialized with protobuf read from Kafka, but You can use it to understand the principles.

Normally I wouldn't recommend reading the model from the file system as it's much less flexible and troublesome to maintain. But that can be possible too, depending on Your infrastructure setup. The only thing, in that case, would be to assert that the file with the model is available on the machine that Pipeline will run on.

Dominik Wosiński
  • 3,769
  • 1
  • 8
  • 22
  • 2
    Just to add if the model is small enough and static, you could even bundle it in your jar. – Arvid Heise Oct 08 '20 at 08:32
  • Thank you for the answer. I was thinking to train the model in Python, save it somewhere or bundle it in my jar, then load the model and apply the prediction on the flink datastream. To do this I was looking to xgboost4j-flink or flink-jpmml, do you suggest any other libraries? maybe something you already used – elton stafa Oct 08 '20 at 10:46
  • I guess it widely depends on the type of the model You wanna use. You don't necessarily need to look for flink version of the ML libs, xgboost4j seems like a good choice if You have model of that type. – Dominik Wosiński Oct 08 '20 at 11:11
  • xgboost4j-flink and flink-jpmml seem like good choices, if they meet your needs. But fwiw, stateful functions and pyflink provide ways to execute python code directly. See https://github.com/ververica/flink-statefun-workshop for an example using stateful functions. https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/python/table-api-users-guide/conversion_of_pandas.html covers using numpy and pandas with pyflink -- not sure about other libraries, but I believe it should be possible. – David Anderson Oct 08 '20 at 18:05