Assume this scenario:
We analyze the data, train some machine learning models using whatever tool we have at hand, and save those models. This is done in Python, using Apache Spark python shell and API. We know Apache Spark is good at batch processing, hence a good choice for the aboce scenario.
Now going into production, for each given request, we need to return a response which depends also on the output of the trained model. This is, I assume, what people call stream processing, and Apache Flink is usually recommended for it. But how would you use the same models trained using tools available in Python, in a Flink pipeline?
The micro-batch mode of Spark wouldn't work here, since we really need to respond to each request, and not in batches.
I've also seen some libraries trying to do machine learning in Flink, but that doesn't satisfy needs of people who have diverse tools in Python and not Scala, and are not even familiar with Scala.
So the question is, how do people approach this problem?
This question is related, but not a duplicate, since the author there mentions explicitly using Spark MLlib. That library runs on JVM, and has more potential to be ported to other JVM based platforms. But here the question is how would people approach it if let say, they were using scikit-learn
, or GPy
or whatever other method/package they use.