I have a pyspark code that have 3 functions. The first function is loads some data and prepares it for other two functions. The other two functions takes this output and do some task and generate respective outputs.
So the code will look something like this,
def first_function():
# load data
# pre-process
# return pre-processed data
def second_function(output_of_first_function):
# tasks for second function
# return output
def third_function(output_of_first_function):
# tasks for third function
# return output
And these functions are called from a main function like this,
def main():
output_from_first_function = first_function()
output_from_second_function = second_function(output_from_first_function)
output_from_third_function = third_function(output_from_first_function)
There is no interdependence among second_function and third_function. I'm looking for a way to run these two functions in parallel at a same time. There are some transforms happening inside these functions. So it may help to help these functions in parallel.
How to run the second_function and third_function in parallel? Should each of these functions create their own spark context or can they share a spark context?