I am doing something like this:
import pandas as pd
pdf = pd.DataFrame({
'a': [1, 2, 3],
'b': ['a', 'b', 'c']
})
parent_df = spark.createDataFrame(pdf)
parent_df.cache().count()
child_df = parent_df.replace('c', 'x')
child_df.cache().count()
parent_df.unpersist()
Essentially, I want to cache the parent_df
because in the next steps, I am doing some heavy transformations on it. Once I finish those and I get back child_df
, I no longer need the parent_df
and so want to release it from the cache. However, doing this unpersists also the freshly cached child_df
!
So obviously, the questions are:
- why does this happen?
- how can I accomplish what I want (releasing
parent_df
from cache while keeping the newchild_df
in cache)?
Interestingly, opposite scenario works - i.e. if I unpersist child_df
instead of parent_df
on the last line, the parent_df
would remain cached as expected while child_df
would be released.
PS: I found a similar question here Understanding Spark's caching . However, the answer for that one does not seem to work in this case, as here we are already calling an action (.count()
) right after caching.