0

I have a dataframe that contains 100,000 products and their time series (row-wise, not 100,000 columns). I need to calculate 40 columns that offer various time lags, differences, or rolling averages for each product. Here, for simplicity, just a single "result" column that calculates the difference between the current "value" and the value at lag 1. For dates earlier than a given deadline (2022-03-01), the value of "result" should come from column "diffold". I'm using groupby, combined with transform and lambda. This gets the job done, but is slow as a snail when it deals with something larger than a toy dataset: 1000s of products, 100s of dates, and 40 different lambda functions, one for each of the columns "result01" to "result40", take hours to process. Is there a faster alternative to transform(lambda) for filling the column "result" with shifted, differenced or rolling values, which "result" should take from column "values"? Here my code:

df = pd.DataFrame(
            {"product":["prod1", "prod1", "prod1",  "prod2", "prod2", "prod2",  "prod3", "prod3", "prod3"], 
            "date": [pd.Timestamp("20220301"), pd.Timestamp("20220302"), pd.Timestamp("20220303") ] *3,
            "value": range(9), 
            "diffold": range(9)
            }
                    )


df["result"] = np.where( 
                        df["date"].values > pd.Timestamp("20220301"),  
                        df.groupby("product")["value"].transform(lambda v: v.diff(1)),   
                        df["diffold"] 
                        )
kerfuffle
  • 45
  • 5
  • Wlecome here. [groupby.transform](https://stackoverflow.com/questions/63406167/pandas-transform-method-performing-slow) is unfortunately slower than other methods. I believe you will have to rethink your operations so that you don't need `transform`. For example, the operation here can be done faster with `df.groupby("product")["value"].diff(1)` – Ben.T Mar 09 '22 at 20:08
  • Also, it seems that `diff` is not that efficient. you can do faster with shift and a substraction: `df['value'] - df.groupby("product")["value"].shift(1)`, especially if you already created the shift column before, you can reuse the result to calculate the difference – Ben.T Mar 09 '22 at 20:44
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Mar 20 '22 at 19:39

0 Answers0