0

We have scenario to read data from multiple source tables and join according to business rules and apply mapping. In some cases, the data read from few tables can be use for multiple target loads. So, to avoid reading of same data multiple times when running through different modules, is there any option how to use the same dataframe output in different pyspark modules.

df1 = spark.sql(select * from table1)
df2 = spark.sql(select * from table2)

df_out = df1.join(df2, ['customer_id'], inner)

I want to use df_out in pyspark_module1.py and also in pyspark_module2.py, is there any way to acheive by not reading the same data multiple times, as we are running programs parallelly through scheduling tool.

Rocky1989
  • 369
  • 8
  • 28

2 Answers2

0

This is where cache() and persist() comes into picture, the cache() holds your data into memory (which is default) till when spark application is executing and persist() allows you to expend your choice of selection into Disk/memory etc. Full read here and here

Now, coming to your question - you might need to revisit your application logic based on how you are implementing catch or persist

If, you write in a main function and your module-1 and module-2 function invokes that main function, the even after caching into memory, it might not be benificial as every time a function call will happen it will invoke underlaying logic for that particular function call, so try if you can write in the same code and take benefit of caching.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
dsk
  • 1,863
  • 2
  • 10
  • 13
0

You could join and pre-process to a degree and save data via bucketBy and then run down-stream in parallel on this pre-joined and pre-processed data.

This https://luminousmen.com/post/the-5-minute-guide-to-using-bucketing-in-pyspark provides guidance as do the Spark docs.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83