Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)

Question

I am a beginner in Python and Pandas, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.

I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame using either hierarchical index or hierarchical column, but not both.

Here's the data.

import pandas as pd
import numpy as np
from numpy import nan as NA

#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
                 index=[['a']*2+['b']*1+['c']*1+['d']*2,
                        [1, 2, 3, 1, 3, 1]],
                 columns = [['Title1']*3+['Title2']*3,
                            ['A']*2+['B']*2+['C']*2])

data.index.names = ['key1','key2']
data.columns.names = ['state','color']

Here are my questions:

Question:1 I'd like to access key1 = a, key2 = 1, state = Title1 (column), and color = A (column). After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1] gives an indexed dataframe, which is then subset...and so on):

data.loc['a',1].loc['Title1'].loc['A']

Is there a better way to subset above?

Question:2 How do I subset the data after deleting the indices?

data_wo_index = data.reset_index()

I'm relatively comfortable with data.table in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table knowledge.

I tried one step at a time, but even the first step (i.e. subsetting key1 = a gave me an error:

data_wo_index[data_wo_index['key1']=='a']

Exception: cannot handle a non-unique multi-index!

I don't know why Pandas is still thinking that there is multi-index. I have already reset it.

Question:3 If I run data.columns command, I get the following output:

MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
           names=['state', 'color'])

It seems to me that column names are also indexes. I am saying this because I see MultiIndex class, which is what I see if I run data.index:

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
           names=['key1', 'key2'])

I am unsure why column names are also on object of MultiIndex class. If they are indeed an object of MultiIndex class, then why do we need to set aside a few columns (e.g. key1 and key2 in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table in R, we can setkey to whatever columns we want.)

Question 4 Why are column names an object of MultiIndex class? It will be great if someone can offer a theoretical treatment for this.

As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.

I'm very sorry about this. I tagged it because I did compare `R - data.table` with `pandas`. — watchtower, Aug 06 '18 at 02:47

score 1 · Answer 1 · answered Aug 06 '18 at 02:01

1

First off, MultiIndex's can be tricky to work with, so it's worth considering whether they actually provide enough benefit for what you're actually doing (in terms of speed/organisation) to make those hassles worthwhile.

To answer your question 1, you can subset a MultiIndexed dataframe by providing tuples of the keys you want for each axis. So you first example subset can be done as:

# We want to use ":" to get all the states, but can't just
#   have ":" by itself due to Python's syntax rules
# So pandas provides the IndexSlice object to wrap it in
slicer = pd.IndexSlice
data.loc[('a', 1), (slicer[:], 'A')]

Which gives:

state   color
Title1  A        0
        A        1
Name: (a, 1), dtype: int32

answered Aug 06 '18 at 02:01

Marius

58,213
16
107
105

Thank you so much for your help. I am using indexes because I read somewhere that `join` and `merge` are faster with indexes. While McKinney has also written that one can do `join` and `merge` using columns. Could you please offer alternatives to indexing? – watchtower Aug 06 '18 at 02:46
1

The alternative to indexing is basically to leave everything in plain columns, like R does (when not using `data.table`). Joins may be slower than the indexed versions, but it can be easier to work with, especially if you're used to R. – Marius Aug 06 '18 at 03:00
Thank you once again. Final q: While I haven't run benchmarks between `pandas` and `data.table`, from your experience, will `pandas` be slower than `data.table` in R? – watchtower Aug 06 '18 at 03:05

score 1 · Answer 2 · answered Aug 06 '18 at 02:04

Wow seems like a lot of questions ..

Q1 Multiple index I will recommend IndexSlice

data.loc[pd.IndexSlice['a',1],pd.IndexSlice['Title1','A']]
Out[410]: 
state   color
Title1  A        0
        A        1

Q2 when you reset the index for this complete data frame it will have some issue , I do not think in R you can do that without ftable

Here is the way doing with pandas

data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
Out[434]: 
state key1 key2 Title1       Title2        
color                A  A  B      B   C   C
0        a    1      0  1  2      3   4   5
1        a    2      6  7  8      9  10  11

Q3 I think the column and index multiple level offer 4 dimension, yes you can using one columns or index to represent all just do stack

data.stack()
Out[436]: 
state            Title1  Title2
key1 key2 color                
a    1    A           0       3
          B           1       4
          C           2       5
     2    A           6       9
          B           7      10
          C           8      11
b    3    A          12      15
          B          13      16
          C          14      17
c    1    A          18      21
          B          19      22
          C          20      23
d    3    A          24      27
          B          25      28
          C          26      29
     1    A          30      33
          B          31      34
          C          32      35

Q4 MultiIndex is one of type for index , and pandas treat index and columns to index type

For example

df.index # index but just different type of index 
Out[441]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
df.columns # index as well 
Out[442]: Index(['A', 'B'], dtype='object')

Thank you so much for your help. I sincerely appreciate it. Could you please give me some resource how I can understand and create `data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]` query that you have posted for Q2? — watchtower, Aug 06 '18 at 02:43

Subsetting Hierarchical Index and Hierarchical column names in Pandas (with and without indices)

2 Answers2