Why is my Spark Job performs two database reads and how can I avoid it

Question

My data workflow is:

rawDf -> modifiedDf -> rollUpDf -> union(modifiedDf, rollUpDf) -> save

The performance was not good enough.

I found two database read actions, one in stage 60(generate rollUpDf) and one in stage 61. I don't understand why it needs to read the database twice as both modifiedDf and rollUpDf are from the same source.

score 1 · Accepted Answer · answered Mar 22 '18 at 14:54

1

one way you can improve performance by using rawDf.cache() to retrieve the data from data base only once and then modify the data frame and roll up the data frame. This is help you to avoid reading the data twice from data base. Reference: (Why) do we need to call cache or persist on a RDD

answered Mar 22 '18 at 14:54

Rumesh Krishnan

443
4
16

Does `df.cache()` stores the dataframe even for future jobs? In my use case, it's possible that multiple users send aggregation requests at the same time (from the same table, but may have different filter). That `df.cache()` will be called many times, does that messed up the stored data? – LN.EXE Mar 22 '18 at 15:23
caching happens for every batch, as soon as batch process complete then cache disappear. – Rumesh Krishnan Mar 22 '18 at 15:26

Why is my Spark Job performs two database reads and how can I avoid it

1 Answers1