How does spark handle overwritting an object?

Question

I wrote a while loop that updates a pyspark dataframe in each iteration. Unfortunately, if I let the loop iterate long enough, I run into an out of memory error.

condition = True
while condition:
    
    # Calculate 
    df = <Do a ton of joins, groupBys and filters here>
    
    # Persist so that newest `df` doesn't have to be calculated from scratch
    df.persist(StorageLevel.DISK_ONLY).count()

A fellow engineer told me that although only the latest version of df is persisted, its previous states are stored in memory, and since the object is still in use, the garbage collector is not deleting those previous states. Is this true? If so, how can delete the previous states of df?

I realize that while loops and pyspark don't mix. However, I am transcribing an algorithm that is iterative in nature and I'm convinced this is the only way to implement it.

I will post the error in a later edit (my session crashed and I cannot retrieve it).

what about call System.gc obviously inside the loop? It will trigger a force gc and remove useless dataframe — linpingta, Jun 14 '23 at 01:09
Wow, thanks! You mean like 'spark.sparkContext._jvm.System.gc()'? — Arturo Sbr, Jun 14 '23 at 02:05
I think yes, please check https://stackoverflow.com/questions/33689536/manually-calling-sparks-garbage-collection-from-pyspark — linpingta, Jun 14 '23 at 02:17
pretty obvious stuff here. for each iteration, you cache a df and don't (rather can't) unpersist the df from prev iteration. so, it runs out of memory because of the large number of cached dfs. *why can't? because the previous dfs aren't useless if they're used in the upcoming iteration(s).* you could, however, be creative and write dataframe to a storage after every few iterations (e.g., write every 5th iteration), and read that back from storage. this breaks the lineage from time to time, and might not result in "out of memory" — samkart, Jun 14 '23 at 08:36
@linpingta Do you know if calling the garbage collector removes all objects or only objects that are not in use? — Arturo Sbr, Jun 14 '23 at 15:27

How does spark handle overwritting an object?

0 Answers0