3

How do I sort data from a DataFrame into a DataFrame that uses a MultiIndex for both the indices and columns?

For example, convert from this:

     0    1  2     3     4
0  foo  two  A  2.30  0.01
1  foo  one  A  4.12  0.13
2  bar  two  B  9.89  3.66
3  foo  one  A  2.11  9.48
4  bar  two  A  1.07  5.55

to this:

            A           B      
            1     2     1     2
foo one  2.11  9.48   NaN   NaN
    two   2.3  0.01   NaN   NaN
bar one   NaN   NaN   NaN   NaN
    two  1.07  5.55  9.89  3.66

Currently I am iterating over each row in df1 and updating the values in df2, but I'd like a more efficient method than this:

for index, row in df1.iterrows():
    df2.loc[(row[0], row[1]), row[2]] = list(row[3:])
tog
  • 103
  • 1
  • 1
  • 6

1 Answers1

2

You can use:

def f(x):
    return pd.DataFrame({'a':x.values.ravel()}).rename(lambda x: x + 1)

df = df.groupby([0,1,2])[3,4].apply(f)['a'].unstack([2,3]).sort_index(level=0, axis=1)
df = df.rename_axis((None, None),axis=1).reindex(pd.MultiIndex.from_product(df.index.levels))
print (df)
            A                       B      
            1     2     3     4     1     2
bar one   NaN   NaN   NaN   NaN   NaN   NaN
    two  1.07  5.55   NaN   NaN  9.89  3.66
foo one  4.12  0.13  2.11  9.48   NaN   NaN
    two  2.30  0.01   NaN   NaN   NaN   NaN

Explanation:

  1. For each group by first 3 columns apply custom function with DataFrame, also increment index value for start from 1

  2. Reshape by unstack and sort Multiindex in columns by sort_index

  3. Remove columns names (2 in left corner) and add missing catagories to MultiIndex in index by reindex and MultiIndex.from_product

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Could you briefly explain how this works to speed up performance? – tog Apr 27 '18 at 07:32
  • @user2970608 - Not sure if understand, last `explanation` is not enough? – jezrael Apr 27 '18 at 07:35
  • I mean why this is faster – tog Apr 27 '18 at 07:36
  • 1
    @user2970608 - Because `iterrows` is obviously very rare used, because very slow - check [this](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316). – jezrael Apr 27 '18 at 07:38