So I am trying to run 1000+ embarassingly parallel jobs using PySpark on a cluster. I instantiate 5 executors, each having 20 cores, i.e. should be able to execute 100 jobs concurrently, as far as I understand.
This is what I have so far, "values" containing tuples, each tuple being a job
values = [(something, something_else),..., (howdy, partner)]
rdd_values = sc.parallelize(values)
results = rdd_values.map(my_func).collect()
A few questions
- Is this really the recommended way of doing this in PySpark?
- The Spark UI is really unhelpful when I run more jobs than I have cores available. What am I missing?
- Why do people use PySpark? It seems so cumbersome
I know this is probably not what PySpark is best suited at but it is the tool I have available.
Rant
Furthermore, it's really frustrating that there is no way of getting a simple progressbar out of this (or is there?). The Spark UI only helps me so far.