Pandas groupby apply is taking too much time

Question

I am having the following code.

pd.DataFrame({'user_wid': {0: 3305613,  1: 57,  2: 80,  3: 31,  4: 38,  5: 12,  6: 35,  7: 25,  8: 42,  9: 16}, 'user_name': {0: 'Ter',  1: 'Am',  2: 'Wi',  3: 'Ma',  4: 'St',  5: 'Ju',  6: 'De',  7: 'Ri',  8: 'Ab',  9: 'Ti'}, 'user_age': {0: 41,  1: 34,  2: 45,  3: 47,  4: 70,  5: 64,  6: 64,  7: 63,  8: 32,  9: 24}, 'user_gender': {0: 'Male',  1: 'Female',  2: 'Male',  3: 'Male',  4: 'Male',  5: 'Female',  6: 'Female',  7: 'Female',  8: 'Female',  9: 'Female'}, 'sale_date': {0: '2018-05-15',  1: '2020-02-28',  2: '2020-04-02',  3: '2020-05-09',  4: '2020-11-29',  5: '2020-12-14',  6: '2020-04-21',  7: '2020-06-15',  8: '2020-07-03',  9: '2020-08-10'}, 'days_since_first_visit': {0: 426,  1: 0,  2: 0,  3: 8,  4: 126,  5: 283,  6: 0,  7: 189,  8: 158,  9: 270}, 'visit': {0: 4, 1: 1, 2: 1, 3: 2, 4: 4, 5: 3, 6: 1, 7: 2, 8: 4, 9: 2}, 'num_user_visits': {0: 4,  1: 2,  2: 1,  3: 2,  4: 10,  5: 7,  6: 1,  7: 4,  8: 4,  9: 2}, 'product': {0: 13, 1: 2, 2: 2, 3: 2, 4: 5, 5: 5, 6: 1, 7: 8, 8: 5, 9: 4}, 'sale_price': {0: 10.0,  1: 0.0,  2: 41.3,  3: 41.3,  4: 49.95,  5: 74.95,  6: 49.95,  7: 5.0,  8: 0.0,  9: 0.0}, 'whether_member': {0: 0,  1: 0,  2: 0,  3: 0,  4: 0,  5: 0,  6: 0,  7: 0,  8: 0,  9: 0}})

def f(x):
   d = {}
   
   d['user_name'] = x['user_name'].max()
   d['user_age'] = x['user_age'].max()
   d['user_gender'] = x['user_gender'].max()
   
   d['last_visit_date'] = x['sale_date'].max()
   d['days_since_first_visit'] = x['days_since_first_visit'].max()
   d['num_visits_window'] = x['visit'].max()
   
   d['num_visits_total'] = x['num_user_visits'].max()
   d['products_used'] = x['product'].max()
   d['user_total_sales'] = (x['sale_price'].sum()).round(2)
   
   d['avg_spend_visit'] = (x['sale_price'].sum() / x['visit'].max()).round(2)
   d['membership'] = x['whether_member'].max()
   
   
   return pd.Series(d) 

users = xactions.groupby('user_wid').apply(f).reset_index()

It is taking too much time to execute, I want to optimize the following function. Any suggestions would be appreciated.

Thanks in advance.

Can you update your post with a plain text sample of your dataframe, please? (and the data length :)) — Corralien, Jul 25 '21 at 20:10
Welcome to stackoverflow, please read [tour] and [mre] and in this case also: [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Andreas, Jul 25 '21 at 20:13

score 0 · Accepted Answer · answered Jul 25 '21 at 20:48

Try:

users2 = xactions.groupby("user_wid", as_index=False).agg(
    user_name=("user_name", "max"),
    user_age=("user_age", "max"),
    user_gender=("user_gender", "max"),
    last_visit_date=("sale_date", "max"),
    days_since_first_visit=("days_since_first_visit", "max"),
    num_visits_window=("visit", "max"),
    num_visits_total=("num_user_visits", "max"),
    products_used=("product", "max"),
    user_total_sales=("sale_price", "sum"),
    membership=("whether_member", "max"),
)
users2["avg_spend_visit"] = (
    users2["user_total_sales"] / users2["num_visits_window"]
).round(2)
print(users2)

Prints:

   user_wid user_name  user_age user_gender last_visit_date  days_since_first_visit  num_visits_window  num_visits_total  products_used  user_total_sales  membership  avg_spend_visit
0        12        Ju        64      Female      2020-12-14                     283                  3                 7              5             74.95           0            24.98
1        16        Ti        24      Female      2020-08-10                     270                  2                 2              4              0.00           0             0.00
2        25        Ri        63      Female      2020-06-15                     189                  2                 4              8              5.00           0             2.50
3        31        Ma        47        Male      2020-05-09                       8                  2                 2              2             41.30           0            20.65
4        35        De        64      Female      2020-04-21                       0                  1                 1              1             49.95           0            49.95
5        38        St        70        Male      2020-11-29                     126                  4                10              5             49.95           0            12.49
6        42        Ab        32      Female      2020-07-03                     158                  4                 4              5              0.00           0             0.00
7        57        Am        34      Female      2020-02-28                       0                  1                 2              2              0.00           0             0.00
8        80        Wi        45        Male      2020-04-02                       0                  1                 1              2             41.30           0            41.30
9   3305613       Ter        41        Male      2018-05-15                     426                  4                 4             13             10.00           0             2.50

Yes, it is definitely faster by 6 times at least for my data. — Chinmay Das, Jul 25 '21 at 20:59

Pandas groupby apply is taking too much time

1 Answers1