2

I am trying to subset a dataframe using a multilevel index. For example:

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df2=df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

As you can see, df2 contains multilevel index with state and office_id. For df2, I would like to subset the dataframe by using the multindex find the following:

1) only state = AZ

2) only office_id <4

3) state = CA and office_id = 5

Historically I would rest the index in the dataframe and subset by columns, but that is not efficient.

Can someone please point me in the right direction? Thank you!

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
Trexion Kameha
  • 3,362
  • 10
  • 34
  • 60

2 Answers2

3

Use index's .get_level_values based indexing i.e an example

df2.loc[(df2.index.get_level_values(0)=='AZ')]
# Also you can specify the name i.e df2.loc[(df2.index.get_level_values('state')=='AZ')]
                 sales
state office_id        
AZ    2          469728
      4          398925
      6          704669

df2.loc[(df2.index.get_level_values(0)=='CA') & (df2.index.get_level_values(1)<4)]

                  sales
state office_id        
CA    1          105244
      3          116514
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
1

You could also use the query method:

My df2 is a little different because of the random numbers:

df2
                  sales
state office_id        
AZ    2          399569
      4          784863
      6          161690
CA    1          324148
      3          631289
      5          917734
CO    1          380714
      3          289783
      5          682802
WA    2          941091
      4          804442
      6          379563

Only Arizona offices:

df2.query('state == "AZ"')
                  sales
state office_id        
AZ    2          399569
      4          784863
      6          161690

Only office ids less than 4:

df2.query('office_id < 4')

                  sales
state office_id        
AZ    2          399569
CA    1          324148
      3          631289
CO    1          380714
      3          289783
WA    2          941091

California and office id = 5

df2.query('state == "CA" & office_id == 5')
                 sales
state office_id        
CA    5          917734
cfort
  • 2,655
  • 1
  • 19
  • 29
  • 1
    You can refer to unnamed indexes with `ilevel_0`, `ilevel_1`. [Doc ref.](https://pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax) – cfort Dec 23 '17 at 18:55
  • But the `ilevel` names only work if the indices have no names. Set `df2.index.names = [None, None]` and then this `df2.query('ilevel_0 == "AZ"')` will work. – cfort Dec 23 '17 at 19:06
  • wouldn't that be nice if you update the same.in the answer.. Its always better to make the solution a generalized version. – Bharath M Shetty Dec 23 '17 at 19:07
  • Is it also possible to subset the dataframe to get e.g. two states (AZ and CA)? – Ben Jan 31 '20 at 14:10
  • `df2.query('state in ["CA", "AZ"]')` – cfort Jan 31 '20 at 15:12