1

I obtained a multi index in pandas by running series.describe() for a grouped dataframe. How can I sort these series by modelName.mean and only keep sepcific fields?multi index This

summary.sortlevel(1)['kappa']

sorts them but retains all the other fields like count. How can I only keep mean and std?

edit

this is a textual representation of the df.

                                             kappa
modelName                                         
biasTotal                          count  5.000000
                                   mean   0.526183
                                   std    0.013429
                                   min    0.507536
                                   25%    0.519706
                                   50%    0.525565
                                   75%    0.538931
                                   max    0.539175
biasTotalWithDistanceMetricAccount count  5.000000
                                   mean   0.527275
                                   std    0.014218
                                   min    0.506428
                                   25%    0.520438
                                   50%    0.529771
                                   75%    0.538475
                                   max    0.541262
lightGBMbiasTotal                  count  5.000000
                                   mean   0.531639
                                   std    0.013819
                                   min    0.513363
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

1

You can do it this way:

Data:

In [77]: df
Out[77]:
                        0
level_1 level_0
a       25%      2.000000
        50%      4.000000
        75%      7.000000
        count    5.000000
        max      7.000000
        mean     4.400000
        min      2.000000
        std      2.509980
b       25%      2.000000
        50%      6.000000
        75%      8.000000
        count    5.000000
        max      8.000000
        mean     5.000000
        min      1.000000
        std      3.316625
c       25%      3.000000
        50%      4.000000
        75%      5.000000
        count    5.000000
        max      8.000000
        mean     4.000000
        min      0.000000
        std      2.915476
d       25%      4.000000
        50%      8.000000
        75%      8.000000
        count    5.000000
        max      9.000000
        mean     6.000000
        min      1.000000
        std      3.391165

Solution:

In [78]: df.loc[pd.IndexSlice[:, ['mean','std']], :]
Out[78]:
                        0
level_1 level_0
a       mean     4.400000
        std      2.509980
b       mean     5.000000
        std      3.316625
c       mean     4.000000
        std      2.915476
d       mean     6.000000
        std      3.391165

Setup:

df = (pd.DataFrame(np.random.randint(0,10,(5,4)),columns=list('abcd'))
        .describe()
        .stack()
        .reset_index()
        .set_index(['level_1','level_0'])
        .sort_index()
)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • when I add a .sortlevel(1) to your df the whole df is sorted but what I rather would like to achieve is that only mean is used for sorting – Georg Heiler Oct 23 '16 at 09:12
  • @GeorgHeiler, can you post your DF in text form (for example output of `print(summary)`) so i could reproduce it? – MaxU - stand with Ukraine Oct 23 '16 at 09:16
  • @MaU sure, please see the edit. As you can see my df's means are not ordered by default as the ones in your example. I would like to order by the mean, but preserve the "stackedness" e.g. the `std` which goes with the respective mean – Georg Heiler Oct 23 '16 at 09:19
  • @GeorgHeiler, i'm afraid you either have to sort your index (all levels) or use `df.reset_index()` and work as with a normal (single level indexed) DF – MaxU - stand with Ukraine Oct 23 '16 at 09:31
  • I see. but a reset index produces 2 records per row e.g. one for mean, one for std in a separate column called level_1 How can I sort this column only by the mean value, but keep the relationship between these 2 rows e.g. have largest mean, accompanying variance, next mean with next variance ,... – Georg Heiler Oct 23 '16 at 09:35
  • @GeorgHeiler, i have another idea - what about creating your `summary` DF differently - `grp = df.groupby(...).agg(...).reset_index(); summary=grp.describe(); summary.ix[['mean','std']]` – MaxU - stand with Ukraine Oct 23 '16 at 09:35
  • interesting idea. but what would you pass to agg? – Georg Heiler Oct 23 '16 at 09:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/126440/discussion-between-maxu-and-georg-heiler). – MaxU - stand with Ukraine Oct 23 '16 at 09:39