I have a dataframe which can be generated using the code below
df2 = pd.DataFrame({'subject_ID':[1,1,1,1,1,1,2,2,2,2],'colum' : ['L1CreaDate','L1Crea','L2CreaDate','L2Crea','L3CreaDate','L3Crea','L1CreaDate','L1Crea','L2CreaDate','L2Crea'],
'dates':['2016-10-30 00:00:00',2.3,'2016-10-30 00:00:00',2.5,np.nan,np.nan,'2016-10-30 00:00:00',12.3,'2016-10-30 00:00:00',12.3]})
I am trying to do the below operations on the above dataframe. Though the code works absolutely fine , the issue is when I use the group by statement
. It's quick in sample dataframe but in real data with over 1 million records, it takes a while and just running for a long time
df2['col2'] = df2['colum'].str.split("Date").str[0]
df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)
I do groupby
to get the below output count
column so that I can reject records with count as 0
. There is a logic involved behind dropping NA's. It's not just about dropping all NA's. If you would like to know about that then refer this post retain few NA's and drop rest of the NA's logic
In real data one person might have more than 10000 rows. So a single dataframe has more than 1 million rows.
Is there any other better and efficient way to do a groupby
or get the count
column?