8

I am following this tutorial: GitHub Link

If you scroll down (Ctrl+F: Exercise: Select the most-reviewd beers ) to the section that says Exercise: Select the most-reviewd beers:

The dataframe is multindexed: enter image description here

To select the most-reviewed beers:

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

My question is the way of how the IndexSlice is used, how come you can skip the colon after top_beers and the code still run?

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']] 

There are three indexes, pofile_name, beed_id and time. Why does pd.IndexSlice[:, top_beers] work (without specify what to do with the time column)?

Cheng
  • 16,824
  • 23
  • 74
  • 104
  • That's what the `:` operator does. You are filtering by only one of the three columns of the hierarchical index. The other two (the ones using `:`) can take any value. You can think of `:` as a filter that matches `True` for any value. – Gustavo Bezerra May 21 '17 at 07:36
  • @GustavoBezerra the problem is that even without the third `:` the code still works. `reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]` works even without the third `:' – Cheng May 21 '17 at 12:18
  • top_beers is a list. your filtering the second level index field beer id by the top_beers. The other two levels are defaulting all values. if you want to slice by range use slice(a:b) – Golden Lion Feb 16 '21 at 17:03

2 Answers2

18

This answer explains how pd.IndexSlice works and why it is useful.

There is not much to say about its implementation. As you read in the source, it just does the following:

class IndexSlice(object):
    def __getitem__(self, arg):
        return arg

From this one can see that pd.IndexSlice only passes the arguments that __getitem__ has received. Looks pretty pointless, doesn't it? But it actually does something.

The function obj.__getitem__(arg) is called when anobject obj is accessed through its bracket operator obj[arg]. For sequence-like objects, arg can be either an integer or a slice object. We rarely construct slices ourselves. Rather, we use the slice operator : (aka ellipsis) for this purpose, e.g. obj[0:5].

And here is the crucial point: The Python interpretor converts these slice operators : into slice objects before calling the object's __getitem__(arg) method. Therefore, the return value of IndexSlice.__getItem__() will actually be a slice, an integer (if no : was used), or a tuple of these (if multiple arguments are passed). In summary, the only purpose of IndexSlice is that we do not have to construct the slices ourselves. This behavior is particularly useful for pd.DataFrame.loc.

The following examples illustrate the behavior of pd.IndexSlice:

import pandas as pd
idx = pd.IndexSlice
print(idx[0])               # 0
print(idx[0,'a'])           # (0, 'a')
print(idx[:])               # slice(None, None, None)
print(idx[0:3])             # slice(0, 3, None)
print(idx[0.1:2.3])         # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c'])     # (slice(0, 3, None), slice('a', 'c', None))

We observe that all usages of colons : are converted into slice objects. If multiple arguments are passed to the index operator, the arguments are turned into n-tuples.

The next example demonstrates how this could be useful for a pandas data-frame df with a multi-level index:

# A sample table with three-level row-index
# and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]), 
                  index=mi, columns=['col1', 'col2'])

# Return a view on 'col1', selecting all rows.
df.loc[:,'col1']            # pd.Series         

# Note in the above example that the returned value has
# type pd.Series, since only one column is returned. One 
# can force the returned object to be a data frame:
df.loc[:,['col1']]          # pd.DataFrame, or
df.loc[:,'col1'].to_frame() # 

# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']   

# To create a slice for multiple index levels, we need to
# somehow pass a list of slices. The following, however,
# leads to a SyntaxError because the slice operator ':'
# cannot be placed inside a list declaration directly.
df.loc[[0:3, 'a':'c'], 'col1'] 

# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']

# This is why pd.IndexSlice is useful. It helps to
# create slices that use two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1'] 

# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] 

# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series

# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does 
# not add any information to the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1']    # pd.Series

# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :]    # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :]       # pd.DataFrame

In summary, pd.IndexSlice improves readability when specifying complicated slices.

What pandas does with these slices is another story. It selects rows/columns, starting from the topmost index level and reduces the selection as it goes further down, depending on how many levels are specified. pd.DataFrame.loc is an object with its own __getitem__() function that does all this.

As has been pointed out in a comment, pandas seemingly behaves weird in some special cases. The two examples the OP has mentioned will evaluate to the same result. However, they are treated differently by pandas internally.

# This will work.
reviews.loc[idx[top_reviewers,        99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers,        99]   , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem affects only the second expression, since
# pandas can get around hashable indices in one case, but it
# cannot in the other.)
reviews.loc[idx[tuple(top_reviewers), 99]   , ['beer_name', 'brewer_id']]

Admittedly, the difference is subtle.

normanius
  • 8,629
  • 7
  • 53
  • 83
  • What indices were float numbers? how would it work then? – arash Jan 11 '20 at 13:04
  • @arash: The same. `slice()` is agnostic of datatypes. It just bundles information about `start`, `end` and `step`. How a particular slice (e.g. `slice(0.1, 2.3, 4.5)`) is interpreted, depends on the object receiving the slice. For a `df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[0.1,2.3,4.5])` you can access all columns by `idx[0.1:4.5]`, which is consistent with the behavior for other index types. And it's not too surprising that `pandas` raises an error for `idx[0.1:4.5:2.3]` because it cannot give sense to a float-type step. – normanius Jan 11 '20 at 13:58
  • @arash See maybe also [this answer](https://stackoverflow.com/a/3912107/3388962) – normanius Jan 11 '20 at 13:58
5

Pandas only requires you to specify enough levels of the MultiIndex to remove an ambiguity. Since you're slicing on the 2nd level, you need the first : to say I'm not filtering on this level.

Any additional levels not specified are returned in their entirety, so equivalent to a : on each of those levels.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • If that is the case then why can't I remove the colon from this line within the same tutorial `reviews.loc[pd.IndexSlice[top_reviewers, 99,:], ['beer_name', 'brewer_id']]`, if I remove the colon and comma after `99`, I get a `unhashable type: 'Index'` error – Cheng May 22 '17 at 00:15
  • Not sure off the top of my head. Based on the error message, about `Index` being unhashable, it's possible it's taking a different indexing path. You could open an issue on github with a simpler example and we'll take a look. – TomAugspurger May 24 '17 at 13:11
  • 1
    @Cheng: The problem is that `top_reviewers` is of type `pd.Index`, which apparently is not hashable out of the box. To fix this, you could transform it into a list first (which can be further transformed into a hashable object). So the following will work: `reviews.loc[pd.IndexSlice[top_reviewers.tolist(), 99], ['beer_name', 'brewer_id']]` – normanius Oct 30 '18 at 15:12
  • @Cheng But it's true that you discovered a small inconsistency in the way pandas processes slices: `top_reviewers` in `pd.IndexSlice[top_reviewers, 99, :]` and `pd.IndexSlice[top_reviewers, 99]` is not treated in exactly the same way, the latter leading to an error, while the former does not. – normanius Oct 30 '18 at 15:17