Improve Performance of Apply Method

Question

I would like to groupby by the variable of my df "cod_id" and then apply this function:

[df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
                        for d in df['dt_op']]

Moving from this df:

print(df)
dt_op      quantity      cod_id
20/01/18      1            613
21/01/18      8            611
21/01/18      1            613 
...

To this one:

print(final_df)
n = 7

dt_op      quantity   product_code     Final_Quantity
20/01/18      1            613               2
21/01/18      8            611               8
25/01/18      1            613               1
...

I tried with:

def lookforward(x):
    L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \
         'quantity'].sum() for row in x.itertuples(index=False)]
    return pd.Series(L, index=x.index)

s = df.groupby('cod_id').apply(lookforward)
s.index = s.index.droplevel(0)

df['Final_Quantity'] = s

print(df)

       dt_op  quantity  cod_id  Final_Quantity
0 2018-01-20         1     613               2
1 2018-01-21         8     611               8
2 2018-01-21         1     613               1

But it is not an efficient solution, since it is computationally slow;

How can I improve its performance? I would achieve it even with a new code/new function that leads to the same result.

EDIT:

Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m":

   print(df)

    cod_id  dt_op          quantita  final_sum
0        2 2017-01-03         1       54.0
1        2 2017-01-04         1       53.0
2        2 2017-01-13         1       52.0
3        2 2017-01-23         2       51.0
4        2 2017-01-26         1       49.0
5        2 2017-02-03         1       48.0
6        2 2017-02-27         1       47.0
7        2 2017-03-05         1       46.0
8        2 2017-03-15         1       45.0
9        2 2017-03-23         1       44.0
10       2 2017-03-27         2       43.0
11       2 2017-03-31         3       41.0
12       2 2017-04-04         1       38.0
13       2 2017-04-05         1       37.0
14       2 2017-04-15         2       36.0
15       2 2017-04-27         2       34.0
16       2 2017-04-30         1       32.0
17       2 2017-05-16         1       31.0
18       2 2017-05-18         1       30.0
19       2 2017-05-19         1       29.0
20       2 2017-06-03         1       28.0
21       2 2017-06-04         1       27.0
22       2 2017-06-07         1       26.0
23       2 2017-06-13         2       25.0
24       2 2017-06-14         1       23.0
25       2 2017-06-20         1       22.0
26       2 2017-06-22         2       21.0
27       2 2017-06-28         1       19.0
28       2 2017-06-30         1       18.0
29       2 2017-07-03         1       17.0
30       2 2017-07-06         2       16.0
31       2 2017-07-07         1       14.0
32       2 2017-07-13         1       13.0
33       2 2017-07-20         1       12.0
34       2 2017-07-28         1       11.0
35       2 2017-08-06         1       10.0
36       2 2017-08-07         1        9.0
37       2 2017-08-24         1        8.0
38       2 2017-09-06         1        7.0
39       2 2017-09-16         2        6.0
40       2 2017-09-20         1        4.0
41       2 2017-10-07         1        3.0
42       2 2017-11-04         1        2.0
43       2 2017-12-07         1        1.0

w-m · Answer 1 · 2018-10-17T17:49:27.977

2

Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments.

Using for loops can be a performance killer when doing pandas operations.

The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here.

Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby.

quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \
              .rolling("7D", on="dt_op").quantity.sum()

cod_id  dt_op     
611     2018-01-21    8.0
613     2018-01-21    1.0
        2018-01-20    2.0
Name: quantity, dtype: float64

result = df.set_index(["cod_id", "dt_op"])
result["final_sum"] = quant_sum
result.reset_index()

   cod_id      dt_op  quantity  final_sum
0     613 2018-01-20         1        2.0
1     611 2018-01-21         8        8.0
2     613 2018-01-21         1        1.0

edited Oct 17 '18 at 17:49

answered Oct 16 '18 at 09:12

w-m

10,772
1
42
49

1

@jpp I've added a solution which requires index setting as `'RollingGroupby' object has no attribute 'transform'`. Can you think of a nicer way to do this? – w-m Oct 16 '18 at 09:36
No, I was trying to use `GroupBy` + `transform` with `rolling` but failed. Fyi, I don't think your result matches OP now for some reason. – jpp Oct 16 '18 at 09:37
@w-m it does not seem to be working; have a look at "final_df": it should sum - in advance - the next 7 days. So, for cod_id 613, the sum should be 2 on the 20th and 1 on the 21th – Alessandro Ceccarelli Oct 16 '18 at 09:49
Best way to do the forward-rolling seems to be just reversing the date index. I've added the part (using `sort_values` instead of `.iloc[::-1]`). – w-m Oct 16 '18 at 10:07
1

@w-m, That seems to work. Happens so often, I wish Pandas would just allows negative windows! – jpp Oct 16 '18 at 10:15
@w-m I added a subset of the dataset, on which I'm trying to run your code, because it does not seem to be working properly. – Alessandro Ceccarelli Oct 17 '18 at 16:34
1

After looking into it - this actually can't work. `.rolling("7D")` on the inverted series (backwards date) is not supported by pandas - https://github.com/pandas-dev/pandas/issues/6772. When done without the `groupby`, pandas throws a `ValueError: dt_op must be monotonic`. The `groupby` seems to hide this error and pandas computes something that is wrong (which is a bug, I think). cc @jpp – w-m Oct 17 '18 at 17:46
I'll leave the answer up to refer to, but to solve your original question this approach won't work. I think you will need some filling of the missing days, or a resampling into weeks. – w-m Oct 17 '18 at 17:47

score 0 · Accepted Answer · answered Oct 17 '18 at 22:36

Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details).

This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data.

# Create a temporary df with all in between days filled in with zeros
filled = df.set_index("dt_op").groupby("cod_id") \
           .resample("D").asfreq().fillna(0) \
           .quantity.to_frame()

# Reverse and sum
filled["quant_sum"] = filled.reset_index().set_index("dt_op") \
                            .iloc[::-1] \
                            .groupby("cod_id") \
                            .rolling(7, min_periods=1) \
                            .quantity.sum().astype(int)

# Join with original `df`, dropping the filled days
result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()

Thank you so much; Alessandro – Alessandro Ceccarelli Oct 18 '18 at 09:47 — Alessandro Ceccarelli, Oct 18 '18 at 09:47

Improve Performance of Apply Method

2 Answers2