Fill missing year groupy by multiple columns and display multiple columns horizontally in order in Pandas

Question

For a dataframe as follows, I want to fill missing years (from 2015 to 2017) in each group of city and district; then calculate pct by grouping by columns: city, district and year, at last step, then display value and pct columns horizontally?

  city district  value  year
0   sh        a      2  2015
1   sh        a      3  2016
2   sh        b      5  2015
3   sh        b      3  2016
4   bj        c      4  2015
5   bj        c      3  2017

What I have done by far:

1. Fill missing years, but not working yet:

rng = pd.date_range('2015', '2017', freq='YS').dt.year
df = df.apply(lambda x: x.reindex(rng, fill_value = 0))

2. Calculating pct by grouping by city and district:

df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()

3. Displaying value and pct columns horizontally but the order is not I wanted:

df.pivot_table(columns='year', index=['city','district'], values=['value', 'pct'], fill_value='NaN').reset_index()

The output I get so far:

      city   district       pct            value          
year                  2015 2016  2017  2015 2016 2017
0      bj        c     NaN  NaN -0.25   4.0  NaN    3
1      sh        a     NaN  0.5   NaN   2.0    3  NaN
2      sh        b     NaN -0.4   NaN   5.0    3  NaN

How could I get the expected result will be like this?

city  district      2015         2016         2017
                value  pct    value  pct  value   pct
bj     c          4                         3        
sh     a          2             3    0.5   
sh     b          5             3   -0.4

Thank you.

With reference https://stackoverflow.com/questions/50951412/pandas-pivot-one-column-while-using-same-column-value-as-column-headers. `df.assign(val=1).pivot_table(columns='year',index=['city','district'],values='value',fill_value='NaN').reset_index()` works. May I ask, `val=1` means? — ah bon, Oct 18 '19 at 08:38
it makes additional column ['val'], in their example with "1" values. They added it to calculate values. But in this case you able to write just this: df.pivot_table(columns='year',index=['city','district'],values='value',fill_value='NaN').reset_index() — Alex, Oct 18 '19 at 09:03

jezrael · Accepted Answer · 2019-11-06T06:35:58.813

Use DataFrame.swaplevel with DataFrame.sort_index, also added another solution for reindex:

rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])

df = df.set_index(['city','district','year']).reindex(mux)

df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()

df = df.pivot_table(columns='year', 
                    index=['city','district'],
                    values=['value', 'pct'],
                    fill_value='NaN')

df = df.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)
print (df)
year          2015       2016        2017      
               pct value  pct value   pct value
city district                                  
bj   c         NaN   4.0  0.0   NaN -0.25     3
sh   a         NaN   2.0  0.5     3  0.00   NaN
     b         NaN   5.0 -0.4     3  0.00   NaN

EDIT: Error:

ValueError: cannot handle a non-unique multi-index!

means there are duplicates, per columns passed to groupby, so here by ['city','district','year']. Solution is create unique values - e.g. by aggregate mean:

print (df)
#  city district  value  year
#0   sh        a      2  2015
#0   sh        a     20  2015
#1   sh        a      3  2016
#2   sh        b      5  2015
#3   sh        b      3  2016
#4   bj        c      4  2015
#5   bj        c      3  2017

rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])

print (df.groupby(['city','district','year'])['value'].mean())
city  district  year
bj    c         2015     4
                2017     3
sh    a         2015    11
                2016     3
      b         2015     5
                2016     3
Name: value, dtype: int64

df = df.groupby(['city','district','year'])['value'].mean().reindex(mux)

print (df)
#city  district  year
#sh    a         2015    11.0
#                2016     3.0
#                2017     NaN
#      b         2015     5.0
#                2016     3.0
#                2017     NaN
#      c         2015     NaN
#                2016     NaN
#                2017     NaN
#bj    a         2015     NaN
#                2016     NaN
#                2017     NaN
#      b         2015     NaN
#                2016     NaN
#                2017     NaN
#      c         2015     4.0
#                2016     NaN
#                2017     3.0
#Name: value, dtype: float64

Great thanks. For calculation of `pct`, with `df.sort_values('year').groupby(['city', 'district'])['value'].apply(lambda x: x.pct_change())`, I got `ValueError: cannot reindex from a duplicate axis`, but with `df.sort_values('year').groupby(['city', 'district']).value.pct_change()`, no errors, any ideas regarding this? — ah bon, Nov 05 '19 at 08:41
@ahbon - hmm, pct_change need sorting, but from error not clear... — jezrael, Nov 05 '19 at 08:42
ok, another thing i get is: `groupby(['city', 'district']).value.pct_change()` is faster than `apply-lambda` method. — ah bon, Nov 05 '19 at 08:47
@ahbon - hmmm, I think reason is some performance improvement, because if use `apply` it is most general function, so slow. — jezrael, Nov 05 '19 at 08:48
Sorry, I get `ValueError: cannot handle a non-unique multi-index!` with my real data, any ideas? — ah bon, Nov 06 '19 at 04:56
The bug come from this line: `df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()` — ah bon, Nov 06 '19 at 05:09
Hi @jezrael, could you please help me to improve my code for a multivariante time series modeling task, I've sent you by email. Great thanks. — ah bon, Nov 27 '19 at 09:21

Fill missing year groupy by multiple columns and display multiple columns horizontally in order in Pandas

1 Answers1

Linked