0

I would like to build a function that makes working with dynamic multi-index dataframe filtering more user friendly.

For example, the function takes a dictionary of index labels and filter values, and tuple of metrics.

In order for this to work, the function needs to make no assumptions about the existence or ordering of index labels. The closest thing I've found is the df.xs().

sample code:

df = pd.DataFrame({'lab1': np.random.choice(['A','B','C'],100,replace=True), 'lab2': np.random.choice(['one','two','three','four'],100,replace=True), 'val': np.random.rand(100)})
df = df.groupby(['lab1','lab2']).sum()

                 val
lab1 lab2           
A    four   3.296221
     one    5.057798
     three  3.443166
     two    3.913044
B    four   3.815448
     one    3.892152
     three  2.995777
     two    9.715343
C    four   6.118737
     one    3.735783
     three  2.461903
     two    5.252095

here's a static example using .xs():

 df.xs(('A', slice('one','three')), level=['lab1','lab2'])
                 val
lab1 lab2           
A    one    5.057798
     three  3.443166

The issue seems to be that you can't easily pass a list argument into slice(). I've tried using pd.IndexSlice, map, lambda, etc but can't get it to work.

What id like to get is this:

filters = {
'lab1': 'A',
'lab2' : ('one','three'),
metrics = ('val')
}
def metric_ts(filters, metrics):
    levels = list(filters.keys()) + ['metric_name']
    keys = map(slice, list(filters.values()))
    return df_norm.xs(keys, levels)

Note: I understand that there are several ways to do this using .loc[], etc. I am looking for a very general solution that does not rely on positional syntax. Thanks!

Jonathan
  • 781
  • 8
  • 20

2 Answers2

1

Not sure how you'd do this with xs, but you can use DataFrame.query, provided you dynamically build your query string.

filters = {
'lab1': 'A',
'lab2' : ('one','three'),
}
metrics = 'val'

globals().update(filters) 

querystr = ' and '.join([
    f"{k} {'==' if isinstance(v, (str, np.number)) else 'in'} @{k}" 
    for k, v in filters.items()])

df.query(querystr)[metrics]  

lab1  lab2 
A     one      4.041335
      three    4.923771
Name: val, dtype: float64

Similar examples can be seen here.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • thanks for the response, that indeed works, however it seems really opaque considering all the structure built around multi-index slicing. `df.xs()` would work well with a dictionary without the need to parse in to query string.. just need to figure out a way to wrap the dict items in `slice` in the correct format, then should work. – Jonathan Dec 04 '18 at 23:45
  • @Jonathan It is hard to argue readability for `xs` too. I would actually think this makes more sense. All it does is enumerate the conditions, making sure the operators "==" and "in" are used appropriately. The only thing I dislike about this answer is that you are forced to inject variables into the global namespace for this to work. – cs95 Dec 04 '18 at 23:50
  • i was originally worried also about performance, but tested it out and seems to work well. My only reservations are potential unanticipated behavior since it goes through so many layers (and my own gap in understanding since i really expected a more elegant solution from pandas). Why does it require global injection? are there any workarounds? – Jonathan Dec 05 '18 at 00:57
  • @Jonathan I can't deny there could be a better solution, but this is one of those esoteric cases I haven't seen around often, so I wouldn't know. Your best bet would be to let this sit for a couple of days, and then if you still don't have an answer, you can ping me and I'll instate a bounty on the question for more attention. – cs95 Dec 05 '18 at 01:00
0

I figured out how to do this using the .xs() method. The trick is to wrap multiple labels in slice() in the filter dict before passing in to the function. IMO I think this is a little cleaner than parsing the dict and using .query().

The only issue now is that slice() is returning the continuous slice based on index order, (I want it only to return the values specified). Hopefully someone can expand on this.

df = pd.DataFrame({'lab1': np.random.choice(['A','B','C'],100,replace=True), 'lab2': np.random.choice(['one','two','three','four'],100,replace=True), 'val': np.random.rand(100)})
df = df.groupby(['lab1','lab2']).sum()

filters = {
'lab1': slice('A','C'),
'lab2' : slice('one','two')
}

def return_slice(filters):
    slices = pd.IndexSlice[tuple(filters.values())]
    levels = list(filters.keys())
    return df.xs(key=slices, level=levels,drop_level=False)

return_slice(filters)

                 val
lab1 lab2           
A    one    3.094135
     three  4.458957
     two    6.896360
B    one    2.917692
     three  6.754484
     two    4.023079
C    one    4.464885
     three  5.982234
     two    4.421695
Jonathan
  • 781
  • 8
  • 20