I am using pyspark for processing data and generate some metrics (around 25/30). Generation of each of metrics independent of each other. Due to company contraints I am not able to paste the code. But my code flow is mentioned below
def metric1_job():
some operations
Write data from above df
def metric2_job()
some operations
Write data from above df
def metric3_job()
.
.
.
def metric25_job()
some operations
Write data from above df
if __name__ == "__main__":
Read Df 1
Read Df 2
Read Df 3
Read Df 4
Read Df 5
Some operations on above Df.
metric1_job(df1, df2, df3, df4, df5)
metric1_job(df1, df2, df3, df4, df5)
metric1_job(df1, df2, df3, df4, df5)
.
.
.
metric25_job(df1, df2, df3, df4, df5)
Now pyspark stop execution at the time of writting in each function and then start processing DAG in other function. All these functions are DAGs and not dependent on each other. One obvious solution is to split then in separate file and run as a separate job. But thats option is not available to me. Can someone tell me how can I make spark run these DAGs in parallel and also write in parallel as well.
Deeply appreciate any help. Due to serial processing above job is taking too much time
Thanks in advance
Manish