Run Pyspark functions in parallel efficiently

Question

I have a pyspark code that have 3 functions. The first function is loads some data and prepares it for other two functions. The other two functions takes this output and do some task and generate respective outputs.

So the code will look something like this,

def first_function():
    # load data
    # pre-process
    # return pre-processed data

def second_function(output_of_first_function):
    # tasks for second function
    # return output

def third_function(output_of_first_function):
    # tasks for third function
    # return output

And these functions are called from a main function like this,

def main():
    output_from_first_function = first_function()
    output_from_second_function = second_function(output_from_first_function)
    output_from_third_function = third_function(output_from_first_function)

There is no interdependence among second_function and third_function. I'm looking for a way to run these two functions in parallel at a same time. There are some transforms happening inside these functions. So it may help to help these functions in parallel.

How to run the second_function and third_function in parallel? Should each of these functions create their own spark context or can they share a spark context?

I'm not a Spark expert, but I would have expected Spark to be smart here when it builds up its DAG since it should be able to see that there is no dependency between `output_from_second_function` and `output_from_third_function`. So I'd be surprised if these are not already being parallelized automatically. — 0x5453, Jul 15 '20 at 13:33
@0x5453 spark actions are blocking so if there are two actions in separate functions one by one, they won't be run in parallel to each other (but each actions will be run in a parallel fashion on the cluster). in a short answer, it's possible to run jobs in parallel on the same spark context through threads, however if they will be run really at the same time depends on the scheduling and cluster; see: https://stackoverflow.com/questions/49568940/how-to-run-multiple-spark-jobs-in-parallel — Daniel, Jul 15 '20 at 15:21
Does this answer your question? [How to run multiple Spark jobs in parallel?](https://stackoverflow.com/questions/49568940/how-to-run-multiple-spark-jobs-in-parallel) — Daniel, Jul 15 '20 at 15:22
@Daniel the A in the link partially answers my question. I want to know how to parallelization inside PySpark code. I can't do spark submit as my code is part of a bigger module. I was looking for an example of how to do it from PySpark — Sreeram TP, Jul 15 '20 at 16:31

score 0 · Answer 1 · answered Jul 15 '20 at 12:26

0

From your problem, it doesn't seems like you really need pyspark. I think you should consider using Python Threads library. As described in this post: How to run independent transformations in parallel using PySpark?

answered Jul 15 '20 at 12:26

Hussnain Raza

51
4

Run Pyspark functions in parallel efficiently

1 Answers1