I am a beginner in Python
and Pandas
, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.
I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs
function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame
using either hierarchical index or hierarchical column, but not both.
Here's the data.
import pandas as pd
import numpy as np
from numpy import nan as NA
#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
index=[['a']*2+['b']*1+['c']*1+['d']*2,
[1, 2, 3, 1, 3, 1]],
columns = [['Title1']*3+['Title2']*3,
['A']*2+['B']*2+['C']*2])
data.index.names = ['key1','key2']
data.columns.names = ['state','color']
Here are my questions:
Question:1 I'd like to access key1 = a
, key2 = 1
, state = Title1
(column), and color = A
(column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1]
gives an indexed dataframe
, which is then subset...and so on):
data.loc['a',1].loc['Title1'].loc['A']
Is there a better way to subset above?
Question:2 How do I subset the data after deleting the indices?
data_wo_index = data.reset_index()
I'm relatively comfortable with data.table
in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table
knowledge.
I tried one step at a time, but even the first step (i.e. subsetting key1 = a
gave me an error:
data_wo_index[data_wo_index['key1']=='a']
Exception: cannot handle a non-unique multi-index!
I don't know why Pandas is still thinking that there is multi-index. I have already reset it.
Question:3 If I run data.columns
command, I get the following output:
MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
names=['state', 'color'])
It seems to me that column names are also indexes. I am saying this because I see MultiIndex
class, which is what I see if I run data.index
:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
names=['key1', 'key2'])
I am unsure why column names are also on object of MultiIndex
class. If they are indeed an object of MultiIndex
class, then why do we need to set aside a few columns (e.g. key1
and key2
in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table
in R, we can setkey to whatever columns we want.)
Question 4 Why are column names an object of MultiIndex
class? It will be great if someone can offer a theoretical treatment for this.
As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.