0

Suppose I have a dataframe with numerous columns and one of the columns is id.

Suppose in a single function, I do a python groupby("id") operations. e.g.,

def func(df):
  df["val1_cumsum"] = df.groupby("id")["val1"].cumsum()
  df["val2_cumsum"] = df.groupby("id")["val2"].cumsum()
  df["val3_cumsum"] = df.groupby("id")["val3"].cumsum()

Do the second and third groupby calls actually do a full groupby like the first one, or is there some native caching in python that says "we just did this, let's use the previous result?"

In other words is the above less performant than:

def func(df):
  df_groupby_id = df.groupby("id")
  df["val1_cumsum"] = df_groupby_id["val1"].cumsum()
  df["val2_cumsum"] = df_groupby_id["val2"].cumsum()
  df["val3_cumsum"] = df_groupby_id["val3"].cumsum()
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
24n8
  • 1,898
  • 1
  • 12
  • 25

1 Answers1

0

Pandas does not have inherent caching for groupby operations. Every time you call df.groupby("id"), Pandas performs the groupby operation from scratch. In your first example, each groupby call will perform the full groupby operation.

The second example, where you store the groupby object in a variable, is more efficient than the first one. By doing this, you perform the groupby operation once and reuse the resulting object for subsequent operations. Your second example would be faster and less resource-intensive than the first one because you are reusing the groupby object.

Boy Nandi
  • 83
  • 7