I need to migrate a Java program to Apache Spark. The current Java heavily utilizes the functionality provided by java.util.concurrent and runs on a single machine. Since the initialization of a worker (Callable) is expensive, the workers are reused again and again - i.e. a worker reinserts itself into the pool once it terminates and has returned its result.
More precise:
- The current implementation works on small data sets in the range of 10E06 entries/few GBs.
- The data contains entries that can be processed independently. That is, one could fire up one worker per task and submit it to the java thread pool.
- However, setting up a worker for processing an entry involves loading more data in, building graphs... all together some GB AND cpu time in the range of minutes.
- Some data can indeed be shared among the workers e.g. some look-up tables but does not need to. Some data is private to the worker and thus not shared. The worker may change the data while processing the entry and only later reset it in a fast manner, e.g. caches specific to the entry currently processed. Thus, the worker can reinsert itself in the pool and start working on the next entry without going though the expensive initialization.
- Runtime per worker and entry is in the range of seconds.
- The workers hand back their results via an ExecutorCompletionService, i.e. the results are later retrieved by calling pool.take().get() in a central part of the program.
Getting to know Apache Spark I find most examples just use standard transformations and actions. I also find examples that add their own functions to the DAG by extending the API. Still, those examples all stick to simple lightweight calculations and come without initialization cost.
I now wonder what is the best approach to design a Spark application that reuses some kind of "heavy worker". The executors seem to be the only persistent entities that could possibly hold a pool of such workers. However, being new to the world of Spark I most likely miss some point...
edited 20161007
Found an answer that points to a (possible) solution using Functions. So the question is, can I
- Split my partition according to the number of executors,
- Each executor gets exactly one partition to work on
- My Function (called setup in the linked solution) creates a thread pool and reuses the workers
- A separate combine function later merges the results