Efficient way to use pandas group by on a million records

Question

I have a dataframe which can be generated using the code below

df2 = pd.DataFrame({'subject_ID':[1,1,1,1,1,1,2,2,2,2],'colum' : ['L1CreaDate','L1Crea','L2CreaDate','L2Crea','L3CreaDate','L3Crea','L1CreaDate','L1Crea','L2CreaDate','L2Crea'], 
                'dates':['2016-10-30 00:00:00',2.3,'2016-10-30 00:00:00',2.5,np.nan,np.nan,'2016-10-30 00:00:00',12.3,'2016-10-30 00:00:00',12.3]})

I am trying to do the below operations on the above dataframe. Though the code works absolutely fine , the issue is when I use the group by statement. It's quick in sample dataframe but in real data with over 1 million records, it takes a while and just running for a long time

    df2['col2'] = df2['colum'].str.split("Date").str[0]
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    df2 = df2.sort_values(by=['subject_ID','col3'])
    df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)

I do groupby to get the below output count column so that I can reject records with count as 0. There is a logic involved behind dropping NA's. It's not just about dropping all NA's. If you would like to know about that then refer this post retain few NA's and drop rest of the NA's logic

In real data one person might have more than 10000 rows. So a single dataframe has more than 1 million rows.

Is there any other better and efficient way to do a groupby or get the count column?

jezrael · Accepted Answer · 2019-08-01T09:09:20.087

Idea is use list comprehension with split for improve performance, then not assign output to new column count but filtering and last sorting with extracted integers:

df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()

df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
print (df2)
   subject_ID       colum                dates    col2  col3
0           1  L1CreaDate  2016-10-30 00:00:00  L1Crea     1
1           1      L1Crea                  2.3  L1Crea     1
2           1  L2CreaDate  2016-10-30 00:00:00  L2Crea     2
3           1      L2Crea                  2.5  L2Crea     2
6           2  L1CreaDate  2016-10-30 00:00:00  L1Crea     1
7           2      L1Crea                 12.3  L1Crea     1
8           2  L2CreaDate  2016-10-30 00:00:00  L2Crea     2
9           2      L2Crea                 12.3  L2Crea     2

If get error:

AttributeError: 'float' object has no attribute 'split'

it means there are possible missing values, so list comprehension should be changed:

df2['col2'] = [x.split("Date")[0] if x == x else np.nan for x in df2['colum']]

For check performance:

def new(df2):
    df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
    df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    return df2.sort_values(by=['subject_ID','col3'])


def orig(df2):
    df2['col2'] = df2['colum'].str.split("Date").str[0]
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    df2 = df2.sort_values(by=['subject_ID','col3'])
    df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)
    return df2[df2['count'].ne(0)]

In [195]: %timeit (orig(df2))
10.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [196]: %timeit (new(df2))
6.11 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

btw, is it possible to find the execution of time of each cell in jupyter notebook? when I use `%timeit`, it doesn't return anything — The Great, Aug 01 '19 at 09:01
@AVLES - hmmm, try custom function, give me some time for solution. — jezrael, Aug 01 '19 at 09:04
Your solution is better than mine in terms of performance, Your knowledge in Python is fantastic and admirable. I have benefited through your answers in forum. Thank you — The Great, Aug 01 '19 at 09:12
just a quick question. Is using `.copy()` function recommended than just assigning to a new dataframe. ex: `df3 = df2.copy()` is recommended/faster than `df3 = df2` — The Great, Aug 01 '19 at 09:21
@AVLES - It depends what need. If need new dataframe with not change original, use copy. if need avoid [error](https://stackoverflow.com/questions/55204404/settingwithcopywarning-after-using-copy/55204409#55204409), need copy. — jezrael, Aug 01 '19 at 09:22
Can you help me with this post? https://stackoverflow.com/questions/57347377/transform-wide-to-long-but-with-repetition-of-a-specific-column — The Great, Aug 04 '19 at 14:31

Efficient way to use pandas group by on a million records

1 Answers1