I need to create a new dataframe that has various summary columns based on a unique user id. As a simple example I have something like this:
df = pd.DataFrame({
'date': [date(2019,10,1), date(2019,10,2), date(2019,10,2), date(2019,10,4)],
'user': ['bob', 'bob', 'sally', 'bob']
})
That I need to transform into a dataframe like this that has totals for number of days we saw the user and the first time we saw them:
df_new = pd.DataFrame({
'user': ['bob', 'sally'],
'first_date': [date(2019,10,1), date(2019,10,2)],
'total_days': [3, 1]
})
What I'm doing is creating interim dataframes based on sorts and pivot tables and then merging together for a final large one (and in my real dataset we have about 15 columns so I'm doing a lot of this!). What my code looks like is:
df = pd.DataFrame({
'date': [date(2019,10,1), date(2019,10,2), date(2019,10,2), date(2019,10,4)],
'user': ['bob', 'bob', 'sally', 'bob']
})
df_new = df.copy()
df_new = df_new.sort_values('date', ascending=True).drop_duplicates(['user']).reset_index(drop=True)
df_new.rename(columns={'date' : 'first_date'}, inplace=True)
df_count = pd.pivot_table(df, values=['date'], index=['user'], aggfunc='count')
df_count.rename(columns={'date' : 'total_days'}, inplace=True)
df_new = pd.merge(df_count, df_new, how='left', on='user')
That works but I'm certain there is more efficient way to do this. Please help!