In our Pyspark applications, we run iterative algorithms on very large dataframes (many billions data points) and need to persist() a lot. Automatic memory clean-up doesn't seem to work for us (wrong data gets purged first), and so we are trying to manually unpersist() certain intermediate dataframes. I would appreciate answers to the following two questions:
How do we know at what point a particular intermediate dataframe can be unpersisted? For instance, let's say, there is one previously persisted dataframe df1 and I obtain another dataframe df2 from it (or from it and something else via join()). Will it be safe to unpersist df1 after calling df2.persist()? If not (which I suspect is the case), can methods toDF() or createDataFrame() be used to create such decoupling?
Sometimes, we would like to release memory of all but few previously persisted dataframes. Our guessed solution is to call
spark.catalog.clearCache()
and immediately follow it by new persist() on the dataframes we'd like to keep in memory, but this is just a guess, and I don't have any solid justification for this.
So, what is the right way?
Thanks in advance