4

I have the following multiindex dataframe:

from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
    A,0, 0,12,34
    A,0, 1,13,34
    A,0, 2,23,34
    A,1, 6,23,38
    A,1, 7,22,38
    B,0,17,15,35
    B,0,18,17,35
    C,0,34,23,32
    C,0,35,21,32
    """)    
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)


>> df
               runtime  value1  value2
File   no               
A      0    0     0       12      34
            1     1       13      34
            2     2       23      34
       1    3     6       23      38
            4     7       22      38
B      0    5     17      15      35
            6     18      17      35
C      0    7     34      23      32
            8     35      21      32

What I would like to get is just the first values of every entry with a new file and a different number

A 0 34
A 1 38
B 0 35
C 0 32

The most similar questions I could find where these

Resample pandas dataframe only knowing result measurement count

MultiIndex-based indexing in pandas

Select rows in pandas MultiIndex DataFrame

but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is

idx = pd.IndexSlice
df.loc[idx[:,0],:]

could, for example, filter for the 0 value but would still return the entire rest of the dataframe.

Is a multiindex even the right tool for the task at hand? How to solve this?

johnDanger
  • 1,990
  • 16
  • 22
Eulenfuchswiesel
  • 879
  • 9
  • 20

2 Answers2

7

Use GroupBy.first by first and second level of MultiIndex:

s = df.groupby(level=[0,1])['value2'].first()
print (s)
File  no
A     0     34
      1     38
B     0     35
C     0     32
Name: value2, dtype: int64

If need one column DataFrame use one element list:

df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
         value2
File no        
A    0       34
     1       38
B    0       35
C    0       32

Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:

df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File  no
A     0     34
      1     38
B     0     35
C     0     32
Name: value2, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).

s = df.groupby(level=[0,1])['value2'].nth(0)

This can be generalized to finding any, not merely the first entry

t = df.groupby(level=[0,1])['value1'].nth(1)

Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.

Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html

Eulenfuchswiesel
  • 879
  • 9
  • 20