I am working on code which uses executor service to parallelize tasks (think machine learning computations done over small dataset over and over again). My goal is to execute some code as fast as possible, multiple times and store the result somewhere (total executions will be on the order of 100M runs atleast).
The logic looks something like this (its a simplified example):
dbconn = new dbconn() //This is reused by all threads
for a in listOfSize1000:
for b in listofSize10:
for c in listOfSize2:
taskcompletionexecutorservice.submit(new runner(a, b, c, dbconn))
At the end, taskcompletionexecutorservice.take() is called and I store the Result from "Future" in a db. But this approach is not really scaling after a point.
So this is what I am doing right now in spark (which is a brutal hack, but I am looking for suggestions on how to best structure this):
sparkContext.parallelize(listOfSize1000).filter(a -> {
dbconn = new dbconn() //Cannot init it outsize parallelize since its not serializable
for b in listofSize10:
for c in listOfSize2:
Result r = new runner(a, b, c. dbconn))
dbconn.store(r)
return true //It serves no purpose.
}).count();
This approach looks inefficient to me since its not truly parallelizing on the smallest unit of job, although this job works alright. Also count is not really doing anything for for me, i added it to trigger the execution. It was inspired by computing the pi example here: http://spark.apache.org/examples.html
So any suggestions of how can I better structure my spark runner so that I can efficiently use spark executors?