I'm struggling to articulate my problem, so I'll demonstrate via example.
Let's say I have a DataFrame that looks like this:
>>> df = pd.DataFrame([{'person': 'bob', 'year': 2016, 'production': 30, 'efficiency': .10}, {'person': 'bob', 'year': 2017, 'production': 35, 'efficiency': .11}, {'person': 'bob', 'year': 2018, 'production': 15, 'efficiency': .05}])
>>> df
efficiency person production year
0 0.10 bob 30 2016
1 0.11 bob 35 2017
2 0.05 bob 15 2018
I need to produce a report that has all of the information for each person on one line. Thus, I'd like to transform the above into:
efficiency 2016 person production 2016 efficiency 2017 production 2017 \
0 0.1 bob 30 0.11 35
efficiency 2018 production 2018
0 0.05 15
This code is able to make that transformation, but it is horribly inefficient:
def combine_years(df):
final_df = None
for name, stats in df.groupby('person'):
agg_df = None
for year in stats['year']:
new_df = stats[stats.year == year].rename(columns=lambda colname: column_renamer(colname, year))
new_df = new_df.drop('year', axis=1)
if agg_df is None:
agg_df = new_df
else:
agg_df = agg_df.merge(new_df, how='outer', on=['person'])
if final_df is None:
final_df = agg_df
else:
final_df = pd.concat([final_df, agg_df], axis=1)
return final_df
A couple questions:
- Is there a more general name for this type of transformation?
- Is there a way to do this more efficiently, using functions provided by pandas?