This answer explains how pd.IndexSlice
works and why it is useful.
There is not much to say about its implementation. As you read in the source, it just does the following:
class IndexSlice(object):
def __getitem__(self, arg):
return arg
From this one can see that pd.IndexSlice
only passes the arguments that __getitem__
has received. Looks pretty pointless, doesn't it? But it actually does something.
The function obj.__getitem__(arg)
is called when anobject obj
is accessed through its bracket operator obj[arg]
. For sequence-like objects, arg
can be either an integer or a slice object. We rarely construct slices ourselves. Rather, we use the slice operator :
(aka ellipsis) for this purpose, e.g. obj[0:5]
.
And here is the crucial point: The Python interpretor converts these slice operators :
into slice objects before calling the object's __getitem__(arg)
method. Therefore, the return value of IndexSlice.__getItem__()
will actually be a slice, an integer (if no :
was used), or a tuple of these (if multiple arguments are passed). In summary, the only purpose of IndexSlice
is that we do not have to construct the slices ourselves. This behavior is particularly useful for pd.DataFrame.loc
.
The following examples illustrate the behavior of pd.IndexSlice
:
import pandas as pd
idx = pd.IndexSlice
print(idx[0]) # 0
print(idx[0,'a']) # (0, 'a')
print(idx[:]) # slice(None, None, None)
print(idx[0:3]) # slice(0, 3, None)
print(idx[0.1:2.3]) # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c']) # (slice(0, 3, None), slice('a', 'c', None))
We observe that all usages of colons :
are converted into slice objects. If multiple arguments are passed to the index operator, the arguments are turned into n-tuples.
The next example demonstrates how this could be useful for a pandas data-frame df
with a multi-level index:
# A sample table with three-level row-index
# and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]),
index=mi, columns=['col1', 'col2'])
# Return a view on 'col1', selecting all rows.
df.loc[:,'col1'] # pd.Series
# Note in the above example that the returned value has
# type pd.Series, since only one column is returned. One
# can force the returned object to be a data frame:
df.loc[:,['col1']] # pd.DataFrame, or
df.loc[:,'col1'].to_frame() #
# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']
# To create a slice for multiple index levels, we need to
# somehow pass a list of slices. The following, however,
# leads to a SyntaxError because the slice operator ':'
# cannot be placed inside a list declaration directly.
df.loc[[0:3, 'a':'c'], 'col1']
# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']
# This is why pd.IndexSlice is useful. It helps to
# create slices that use two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1']
# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1']
# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series
# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does
# not add any information to the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1'] # pd.Series
# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :] # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :] # pd.DataFrame
In summary, pd.IndexSlice
improves readability when specifying complicated slices.
What pandas does with these slices is another story. It selects rows/columns, starting from the topmost index level and reduces the selection as it goes further down, depending on how many levels are specified. pd.DataFrame.loc
is an object with its own __getitem__()
function that does all this.
As has been pointed out in a comment, pandas seemingly behaves weird in some special cases. The two examples the OP has mentioned will evaluate to the same result. However, they are treated differently by pandas internally.
# This will work.
reviews.loc[idx[top_reviewers, 99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers, 99] , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem affects only the second expression, since
# pandas can get around hashable indices in one case, but it
# cannot in the other.)
reviews.loc[idx[tuple(top_reviewers), 99] , ['beer_name', 'brewer_id']]
Admittedly, the difference is subtle.