Compute the final dataframe ready for use

Asked Apr 16 '20 at 13:25

Active Apr 16 '20 at 13:25

Viewed 15 times

I have big job to join different tables together and eventually have an aggregated table for final report. But every time when fetching the final summary table for some filters, the job takes really long time to finish and I believe because of lazy evaluation of Spark. Is there a way to evaluate the final summary table first so that later when filtering the summary each time, it could be faster?

I know if I write that summary table to storage and read it back, it could solve the problem but if I don't want to write and read back, is there any other way?

asked Apr 16 '20 at 13:25

Yi Du

1

Does this answer your question? [What is the difference between cache and persist?](https://stackoverflow.com/questions/26870537/what-is-the-difference-between-cache-and-persist) – Grzegorz Skibinski Apr 16 '20 at 13:28
In simple case -> ```df=df.cache()``` will do, but depending on your environment, certain storage level ```persist``` might be better idea... – Grzegorz Skibinski Apr 16 '20 at 13:29

Compute the final dataframe ready for use

0 Answers0