If we are creating dataFrame from python objects like [dict or list], even if python data should be collect by GC, but as far as dataFrame is lazy, it needs to keep source of the data. So every python object is kept in memory of interpreter. If such objects are really big it can cause crush of spark execution as python interpreter can't get enough memory. Am i right?
def get_df():
# d should be collected by GC after leaving body of function
d = {'a': 1}
df = spark.createDataFrame(d)
return df
df = get_df()
# d still in memory of interpreter