I have a dataframe that contains 100,000 products and their time series (row-wise, not 100,000 columns). I need to calculate 40 columns that offer various time lags, differences, or rolling averages for each product. Here, for simplicity, just a single "result" column that calculates the difference between the current "value" and the value at lag 1. For dates earlier than a given deadline (2022-03-01), the value of "result" should come from column "diffold". I'm using groupby, combined with transform and lambda. This gets the job done, but is slow as a snail when it deals with something larger than a toy dataset: 1000s of products, 100s of dates, and 40 different lambda functions, one for each of the columns "result01" to "result40", take hours to process. Is there a faster alternative to transform(lambda) for filling the column "result" with shifted, differenced or rolling values, which "result" should take from column "values"? Here my code:
df = pd.DataFrame(
{"product":["prod1", "prod1", "prod1", "prod2", "prod2", "prod2", "prod3", "prod3", "prod3"],
"date": [pd.Timestamp("20220301"), pd.Timestamp("20220302"), pd.Timestamp("20220303") ] *3,
"value": range(9),
"diffold": range(9)
}
)
df["result"] = np.where(
df["date"].values > pd.Timestamp("20220301"),
df.groupby("product")["value"].transform(lambda v: v.diff(1)),
df["diffold"]
)