9

Questions are at the end, in bold. But first, let's set up some data:

import numpy as np
import pandas as pd
from itertools import product

np.random.seed(1)

team_names = ['Yankees', 'Mets', 'Dodgers']
jersey_numbers = [35, 71, 84]
game_numbers = [1, 2]
observer_names = ['Bill', 'John', 'Ralph']
observation_types = ['Speed', 'Strength']

row_indices = list(product(team_names, jersey_numbers, game_numbers, observer_names, observation_types))
observation_values = np.random.randn(len(row_indices))

tns, jns, gns, ons, ots = zip(*row_indices)

data = pd.DataFrame({'team': tns, 'jersey': jns, 'game': gns, 'observer': ons, 'obstype': ots, 'value': observation_values})

data = data.set_index(['team', 'jersey', 'game', 'observer', 'obstype'])
data = data.unstack(['observer', 'obstype'])
data.columns = data.columns.droplevel(0)

this gives: data

I want to pluck out a subset of this DataFrame for subsequent analysis. Say I wanted to slice out the rows where the jersey number is 71. I don't really like the idea of using xs to do this. When you do a cross section via xs you lose the column you selected on. If I run:

data.xs(71, axis=0, level='jersey')

then I get back the right rows, but I lose the jersey column.

xs_slice

Also, xs doesn't seem like a great solution for the case where I want a few different values from the jersey column. I think a much nicer solution is the one found here:

data[[j in [71, 84] for t, j, g in data.index]]

boolean_slice_1

You could even filter on a combination of jerseys and teams:

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]

boolean_slice_2

Nice!

So the question: how can I do something similar for selecting a subset of columns. For example, say I want only the columns representing data from Ralph. How can I do that without using xs? Or what if I wanted only the columns with observer in ['John', 'Ralph']? Again, I'd really prefer a solution that keeps all the levels of the row and column indices in the result...just like the boolean indexing examples above.

I can do what I want, and even combine selections from both the row and column indices. But the only solution I've found involves some real gymnastics:

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

double_boolean_slice

And thus the second question: is there a more compact way to do what I just did above?

Community
  • 1
  • 1
8one6
  • 13,078
  • 12
  • 62
  • 84
  • great methods, but what is your question? – MattDMo Dec 24 '13 at 04:15
  • @MattDMo I've **bolded** the specific questions above. More generally: I think I've shown some powerful but syntactically ugly recipes above. I was hopeful that there would be a more direct way to accomplish what I did up there. Specifically, I am looking for a method that will restrict the rows based on the values in one or more of the row indices and simultaneously restrict the columns based on the values in one or more of the column indices. Very much hoping someone can suggest a more natural approach. – 8one6 Dec 24 '13 at 04:24
  • 2
    Interesting question. For the one-element filter case, you can pass `drop_level=False` to avoid losing the `Jersey` column. And note that instead of the transpositions, you could write `data.loc[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index], [obs in ['John', 'Ralph'] for obs, obstype in data.columns]]`. – DSM Dec 24 '13 at 05:48
  • @DSM That's also fastest solution so far (just), though loc seems *slow* here... Seems like this could be good feature request. – Andy Hayden Dec 24 '13 at 05:54
  • 1
    @AndyHayden: I wouldn't mind something like `df.fx(rows={"jersey": [71], "team": ["Dodgers", "Mets"]}, columns={"observer": ["John", "Ralph"]})` which basically did what's desired here. – DSM Dec 24 '13 at 06:01
  • Since the column level you want to select is the first level, you can just do `data[['John', 'Ralph']]` for that part. – joris Dec 24 '13 at 08:09
  • @joris, that's the case here, but not the case generally, though of course you're right for the case shown. – 8one6 Dec 24 '13 at 11:45
  • @AndyHayden: `df.fx` is a new feature suggestion, not a current feature? If so, it would be nice to be able to specify admissible index values as: a single value, a list of values, a boolean object whose length matches the length of the index level in question, or a function that returns such a boolean when passed the index level as an argument. So for example: `df.fx(rows={"jersey": lambda x: x<75, "team": ["Dodgers", "Mets"]}, columns={"observer": [False, True, True]})` to get jerseys < 75, teams in ['Dodgers', 'Mets'] and skipping the first observer. – 8one6 Dec 24 '13 at 11:58

4 Answers4

2

As of Pandas 0.18 (possibly earlier) you can easily slice multi-indexed DataFrames using pd.IndexSlice.

For your specific question, you can use the following to select by team, jersey, and game:

data.loc[pd.IndexSlice[:,[71, 84],:],:] #IndexSlice on the rows

IndexSlice needs just enough level information to be unambiguous so you can drop the trailing colon:

data.loc[pd.IndexSlice[:,[71, 84]],:]

Likewise, you can IndexSlice on columns:

data.loc[pd.IndexSlice[:,[71, 84]],pd.IndexSlice[['John', 'Ralph']]]

Which gives you the final DataFrame in your question.

Cory Jog
  • 21
  • 5
1

Here is one approach that uses slightly more built-in-feeling syntax. But it's still clunky as hell:

data.loc[
    (data.index.get_level_values('jersey').isin([71, 84])
     & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
    data.columns.get_level_values('observer').isin(['John', 'Ralph'])
]

So comparing:

def hackedsyntax():
    return data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

def uglybuiltinsyntax():
    return data.loc[
        (data.index.get_level_values('jersey').isin([71, 84])
         & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
        data.columns.get_level_values('observer').isin(['John', 'Ralph'])
    ]

%timeit hackedsyntax()
%timeit uglybuiltinsyntax()

hackedsyntax() - uglybuiltinsyntax()

results:

1000 loops, best of 3: 395 µs per loop
1000 loops, best of 3: 409 µs per loop

comparison_of_methods

Still hopeful there's a cleaner or more canonical way to do this.

8one6
  • 13,078
  • 12
  • 62
  • 84
1

Note: Since Pandas v0.20, ix accessor has been deprecated; use loc or iloc instead as appropriate.

If I've understood the question correctly, it's pretty simple:

To get the column for Ralph:

data.ix[:,"Ralph"]

to get it for two of them, pass in a list:

data.ix[:,["Ralph","John"]]

The ix operator is the power indexing operator. Remember that the first argument is rows, and then columns (as opposed to data[..][..] which is the other way around). The colon acts as a wildcard, so it returns all the rows in axis=0.

In general, to do a look up in a MultiIndex, you should pass in a tuple. e.g.

data.[:,("Ralph","Speed")]

But if you just pass in a single element, it will treat this as if you're passing in the first element of the tuple and then a wildcard.

Where it gets tricky is if you want to access columns that are not level 0 indices. For example, get all the columns for "speed". Then you'd need to get a bit more creative.. Use the get_level_values method of index/column in combination with boolean indexing:

For example, this gets jersey 71 in the rows, and strength in the columns:

data.ix[data.index.get_level_values("jersey") == 71 , \
        data.columns.get_level_values("obstype") == "Strength"]
jpp
  • 159,742
  • 34
  • 281
  • 339
Luciano
  • 2,388
  • 1
  • 22
  • 33
  • Yes. I am explicitly asking about how to handle filtering based on the non-leading entry in a multi-index. – 8one6 Dec 24 '13 at 22:20
  • I updated the answer in response to your comment. But note the examples you gave in your question were all for level-0 indices. i.e. you can have multi-indices both in the column axis as well as the row axis. In your example, `Jack` is level 0 of the column multi index. Just noticed, it is in fact almost the same as in the answer below. – Luciano Dec 25 '13 at 11:41
0

Note that from what I understand, select is slow. But another approach here would be:

data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1)

you can also chain this with a selection against the rows:

data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1) \
    .select(lambda row: row[1] in [71, 84] and row[2] > 1, axis=0)

The big drawback here is that you have to know the index level number.

Paul H
  • 65,268
  • 20
  • 159
  • 136