Faster Alternative to Aggregating and Pivoting a Dataframe?

Question

I have a dataframe of population data that looks like this, where every row is the population of a county taken at a specific date

County Date     Population
Alba   1900-1-1   1094
Alba   1900-2-1   1107
Alba   1900-3-1   1120
Belfor 1900-1-1   9756
Belfor 1900-3-1   9976
...

I want to get a dataframe where the index is the year, the columns are the county names, and the values are the average population in a given year.

This is what my code looks like currently

#year
df['Year'] = pd.DatetimeIndex(df['date']).year.fillna(0).astype(int)

##create aggregation on year, county of population
new_df =  df.groupby(['Year','County']).mean()
#turn county into column
new_df.reset_index(level=1, inplace=True)
##pivot dataframe
new_df.pivot(columns='Country',values='Population')

This has proven extraordinarily slow since there are several entries per year for hundreds of years. What can I do instead to make this run faster

Use `df.pivot_table(index='Year', columns='County', values='Population')`, default aggregation is `mean`, so `aggfunc` is not necessary — jezrael, Sep 22 '20 at 05:30

Shubham Sharma · Accepted Answer · 2020-09-22T05:28:55.680

1

Let us try crosstab to create a cross tabulation:

df['Date'] = pd.to_datetime(df['Date'])
pd.crosstab(df['Date'].dt.year, df['County'], df['Population'], aggfunc='mean')

Alternatively you can use pivot_table:

df.pivot_table(index=df['Date'].dt.year, columns='County', values='Population', aggfunc='mean')

Result:

County  Alba  Belfor
Date                
1900    1107    9866

edited Sep 22 '20 at 05:28

answered Sep 22 '20 at 05:22

Shubham Sharma

68,127
6
24
53

What is the performance of cross_tab and pivot_table compared to aggregate + pivot. I would assume pivot_table is similar internally. Is cross tab much faster? – knowads Sep 22 '20 at 05:42
@knowads I think performance wise here `crosstab` is equivalent to using `pivot_table`.. – Shubham Sharma Sep 22 '20 at 05:46
Update: Did a timing test and Crosstab was slightly faster on this particular dataset but they were close enough that both would have been fine – knowads Sep 23 '20 at 01:57
@knowads Thanks for testing..That's good to know :) – Shubham Sharma Sep 23 '20 at 04:23

Faster Alternative to Aggregating and Pivoting a Dataframe?

1 Answers1