We have scenario to read data from multiple source tables and join according to business rules and apply mapping. In some cases, the data read from few tables can be use for multiple target loads. So, to avoid reading of same data multiple times when running through different modules, is there any option how to use the same dataframe output in different pyspark modules.
df1 = spark.sql(select * from table1)
df2 = spark.sql(select * from table2)
df_out = df1.join(df2, ['customer_id'], inner)
I want to use df_out in pyspark_module1.py and also in pyspark_module2.py, is there any way to acheive by not reading the same data multiple times, as we are running programs parallelly through scheduling tool.