PySpark running embarassingly parallel jobs

Question

So I am trying to run 1000+ embarassingly parallel jobs using PySpark on a cluster. I instantiate 5 executors, each having 20 cores, i.e. should be able to execute 100 jobs concurrently, as far as I understand.

This is what I have so far, "values" containing tuples, each tuple being a job

values = [(something, something_else),..., (howdy, partner)]    
rdd_values = sc.parallelize(values)
results = rdd_values.map(my_func).collect()

A few questions

Is this really the recommended way of doing this in PySpark?
The Spark UI is really unhelpful when I run more jobs than I have cores available. What am I missing?
Why do people use PySpark? It seems so cumbersome

I know this is probably not what PySpark is best suited at but it is the tool I have available.

Rant

Furthermore, it's really frustrating that there is no way of getting a simple progressbar out of this (or is there?). The Spark UI only helps me so far.

are you running on a cluster or locally on your computer / laptop ? — user238607, Aug 30 '23 at 15:30
On a cluster. Otherwise this would be horribly overcomplicated. Updated question to clarify this, thanks. — Marcus Therkildsen, Aug 30 '23 at 15:34
Can you add the spark-submit command you are using? Also why are reading values in from a single source and collecting all the data in the driver? — user238607, Aug 30 '23 at 15:54
I think I misunderstood your question. You can use futures to run parallel jobs like shown here. https://medium.com/analytics-vidhya/spark-parallel-job-submission-38b41220397b — user238607, Aug 30 '23 at 16:02
I am running it from within a Jupyter Notebook, so I don't have a spark-submit command. Each job consist of executing an .exe executable with some parameters: the "values". The output is mainly just to check whether or not these executions went ok or not. Is there a better way? — Marcus Therkildsen, Aug 30 '23 at 16:03
Can you post more details about what exactly your task is in more detail? The question is very vague. Are you running this on a databricks cluster with Jupyter frontend? We can continue this discussion in chat if you want. — user238607, Aug 30 '23 at 16:26
Take a look at this : https://stackoverflow.com/questions/30214474/how-to-run-multiple-jobs-in-one-sparkcontext-from-separate-threads-in-pyspark — user238607, Aug 30 '23 at 16:44
I posted an answer similar to your question. https://stackoverflow.com/questions/76961816/how-to-call-function-multiple-times-with-different-input-arguments-in-pyspark/77010559#77010559 — user238607, Aug 30 '23 at 18:17

PySpark running embarassingly parallel jobs

0 Answers0