Let's say I have this code:
def func1():
# some code to create a dataframe df
df.persist(StorageLevel.MEMORY_AND_DISK)
return df.repartition("col1", "col2")
def func2(df: Dataframe):
df = (df.select("col1", "col2").groupby("col1").count().withColumnRenamed("count", "count_col1"))
return df
so here in func2 when I pass the variable 'df' is it passed by reference or value? the repartition() that I am applying in func1, that should help with increasing perf when using the df in func2 right for groupBy? Similarly if I apply the persist() in func1, then it would be saved in memory right, an then when I refer to df in func2(), it will be referenced from the same location where it was saved only once in func1(). is that correct?
Thanks!