I wrote a while loop that updates a pyspark dataframe in each iteration. Unfortunately, if I let the loop iterate long enough, I run into an out of memory error.
condition = True
while condition:
# Calculate
df = <Do a ton of joins, groupBys and filters here>
# Persist so that newest `df` doesn't have to be calculated from scratch
df.persist(StorageLevel.DISK_ONLY).count()
A fellow engineer told me that although only the latest version of df
is persisted, its previous states are stored in memory, and since the object is still in use, the garbage collector is not deleting those previous states. Is this true? If so, how can delete the previous states of df
?
I realize that while loops and pyspark don't mix. However, I am transcribing an algorithm that is iterative in nature and I'm convinced this is the only way to implement it.
I will post the error in a later edit (my session crashed and I cannot retrieve it).