My data workflow is:
rawDf -> modifiedDf -> rollUpDf -> union(modifiedDf, rollUpDf) -> save
The performance was not good enough.
I found two database read actions, one in stage 60(generate rollUpDf) and one in stage 61. I don't understand why it needs to read the database twice as both modifiedDf and rollUpDf are from the same source.