How can I improve this pandas data transformation to be more efficient?

Question

I need to create a new dataframe that has various summary columns based on a unique user id. As a simple example I have something like this:

df = pd.DataFrame({
    'date': [date(2019,10,1), date(2019,10,2), date(2019,10,2), date(2019,10,4)],
    'user': ['bob', 'bob', 'sally', 'bob']
})

That I need to transform into a dataframe like this that has totals for number of days we saw the user and the first time we saw them:

df_new = pd.DataFrame({
    'user': ['bob', 'sally'],
    'first_date': [date(2019,10,1), date(2019,10,2)],
    'total_days': [3, 1]
})

What I'm doing is creating interim dataframes based on sorts and pivot tables and then merging together for a final large one (and in my real dataset we have about 15 columns so I'm doing a lot of this!). What my code looks like is:

df = pd.DataFrame({
    'date': [date(2019,10,1), date(2019,10,2), date(2019,10,2), date(2019,10,4)],
    'user': ['bob', 'bob', 'sally', 'bob']
})

df_new = df.copy()
df_new = df_new.sort_values('date', ascending=True).drop_duplicates(['user']).reset_index(drop=True)
df_new.rename(columns={'date' : 'first_date'}, inplace=True)

df_count = pd.pivot_table(df, values=['date'], index=['user'], aggfunc='count')
df_count.rename(columns={'date' : 'total_days'}, inplace=True)

df_new = pd.merge(df_count, df_new, how='left', on='user')

That works but I'm certain there is more efficient way to do this. Please help!

The solutions in that duplicate, especially cs95's, should give you a lot of information. If using a newer version of `pandas` you can rename everything at once with `named` aggregations. The key here is you want to use both `min` and `pd.Series.nunique` (or maybe `'count'`) to aggegate the `'date'` column: `df.groupby('user').agg(first_date = ('date', 'min'), total_days = ('date', pd.Series.nunique))` — ALollz, Dec 17 '19 at 18:56

How can I improve this pandas data transformation to be more efficient?

0 Answers0