2

I am trying to understand multi-indexing. I have found some very good links (here by Jake VanderPlas and here by Nelson Minar) but I am not able to grasp the concept.

I do have some specific questions. Specifically, if we talk about this data -

import pandas as pd

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)

Then:

  1. Why/how does health_data.loc[:,'Guido'] removes the top column index whereas health_data.loc[:,['Guido']] preserves it.
  2. Why do health_data.loc[:, [('Bob', 'HR')]] and health_data.loc[:, ('Bob', 'HR')] work as intended (assuming answer to question 1 is clear) but health_data.loc[:, ['Bob', 'HR']] gives an extra column.
  3. If I define idx = pd.IndexSlice then why health_data.loc[:,[idx['Bob','HR']]] returns the intended output but health_data.loc[:,list(idx['Bob','HR'])] returns the extra column

As my questions arose from my lack of understanding of multi-index, any links which explains them in detail will be help as well. I have seen some of the SO questions and answers (this one helps a bit) but most of them are very specific and I could not find one that talks about in general about double bracket concept.

Gaurav Singhal
  • 998
  • 2
  • 10
  • 25

1 Answers1

1

Q1: health_data.loc[:,'Guido'] selects a single column from the first level of the column multindex, whereas health_data.loc[:,['Guido']] selects a list of columns from the same. I suppose the pandas devs could have chosen to make the 1-element list a special case, and drop the redundant dimension from the response, but they didn't.

In [6]: health_data.loc[:,'Guido']
Out[6]: 
type          HR  Temp
year visit            
2013 1      38.0  35.4
     2      29.0  36.9
2014 1      41.0  36.6
     2      26.0  36.9

In [7]: health_data.loc[:,['Guido']]
Out[7]: 
subject    Guido      
type          HR  Temp
year visit            
2013 1      38.0  35.4
     2      29.0  36.9
2014 1      41.0  36.6
     2      26.0  36.9

Q2: cases 1 and 2 select 1 column, first as a 1-element list of columns, then as a single column value (like cases 2 and 1 respectively of your Q1). case 3 is more interesting: it selects a list of items from the first level of the multiindex. Since only 'Bob' exists in that levels values, it returns all columns which have 'Bob' in the first level

In [29]: health_data.loc[:, [('Bob', 'HR')]]
Out[29]: 
subject      Bob
type          HR
year visit      
2013 1      29.0
     2      47.0
2014 1      56.0
     2      37.0

In [30]: health_data.loc[:, ('Bob', 'HR')]
Out[30]: 
year  visit
2013  1        29.0
      2        47.0
2014  1        56.0
      2        37.0
Name: (Bob, HR), dtype: float64

In [31]: health_data.loc[:, ['Bob', 'HR']]
Out[31]: 
subject      Bob      
type          HR  Temp
year visit            
2013 1      29.0  36.5
     2      47.0  37.7
2014 1      56.0  38.5
     2      37.0  37.3

Q3 this is essentially a rerun of Q2: idx['Bob','HR'] evaluates to ('Bob', 'HR') which selects a single column from the multiindex. list(idx['Bob','HR']) evaluates to ['Bob','HR'] which selects a list of columns from the 1st level of the multiindex.

In [18]: health_data.loc[:,[idx['Bob','HR']]]
Out[18]: 
subject      Bob
type          HR
year visit      
2013 1      29.0
     2      47.0
2014 1      56.0
     2      37.0

In [19]: health_data.loc[:,list(idx['Bob','HR'])]
Out[19]: 
subject      Bob      
type          HR  Temp
year visit            
2013 1      29.0  36.5
     2      47.0  37.7
2014 1      56.0  38.5
     2      37.0  37.3

Since 'HR' does not exist in the 1st level of the multi index, the slice selects just 'Bob'. You can see the difference more clearly if you substitute 'Sue' in place of 'HR':

In [27]: health_data.loc[:,[idx['Bob','Sue']]]
Out[27]: 
subject    Bob
type       Sue
year visit    
2013 1     NaN
     2     NaN
2014 1     NaN
     2     NaN

In [28]: health_data.loc[:,list(idx['Bob','Sue'])]
Out[28]: 
subject      Bob         Sue      
type          HR  Temp    HR  Temp
year visit                        
2013 1      29.0  36.5  35.0  36.4
     2      47.0  37.7  36.0  36.4
2014 1      56.0  38.5  34.0  37.8
     2      37.0  37.3  30.0  36.9

In the first case, we are selecting a column with the multiindex value of ('Bob', 'Sue') which doesn't exist - hence the NaNs. In the 2nd case we are selecting columns which contain all values from the list ['Bob', 'Sue'] in the first level of the multiindex. I must admit, it took me by surprise to see tuples and lists give different results here. Again I suppose this is an API decision...

The docs cover most of this:

https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

https://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers

Rob Buckley
  • 708
  • 4
  • 16
  • just corrected a misreading of part 2 :-). Also, I don't know why the code block formatting has only worked for the last part of my answer - sorry about that! – Rob Buckley May 15 '18 at 13:47
  • Thanks for the detailed step by step answer, i think (not very sure) code block requires double indentation when the above paragraph is a list. I have done the some editing to make it more readable but my changes will be visible after peer review only. I need some time to go through the answer, and will get back if I have some questions. – Gaurav Singhal May 15 '18 at 16:39
  • hey, thanks for fixing the code block stuff. Let me know if something's still not clear. Multiindexes can be a confusing topic, I know from experience... – Rob Buckley May 15 '18 at 17:29