12

I'm using a MultiIndexed pandas DataFrame and would like to multiply a subset of the DataFrame by a certain number.

It's the same as this but with a MultiIndex.

>>> d = pd.DataFrame({'year':[2008,2008,2008,2008,2009,2009,2009,2009], 
                      'flavour':['strawberry','strawberry','banana','banana',
                      'strawberry','strawberry','banana','banana'],
                      'day':['sat','sun','sat','sun','sat','sun','sat','sun'],
                      'sales':[10,12,22,23,11,13,23,24]})

>>> d = d.set_index(['year','flavour','day'])                  

>>> d
                     sales
year flavour    day       
2008 strawberry sat     10
                sun     12
     banana     sat     22
                sun     23
2009 strawberry sat     11
                sun     13
     banana     sat     23
                sun     24

So far, so good. But let's say I spot that all the Saturday figures are only half what they should be! I'd like to multiply all sat sales by 2.

My first attempt at this was:

sat = d.xs('sat', level='day')
sat = sat * 2
d.update(sat)

but this doesn't work because the variable sat has lost the day level of the index:

>>> sat
                 sales
year flavour          
2008 strawberry     20
     banana         44
2009 strawberry     22
     banana         46

so pandas doesn't know how to join the new sales figures back onto the old dataframe.

I had a quick stab at:

>>> sat = d.xs('sat', level='day', copy=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2248, in xs
    raise ValueError('Cannot retrieve view (copy=False)')
ValueError: Cannot retrieve view (copy=False)

I have no idea what that error means, but I feel like I'm making a mountain out of a molehill. Does anyone know the right way to do this?

Thanks in advance, Rob

Community
  • 1
  • 1
LondonRob
  • 73,083
  • 37
  • 144
  • 201
  • 1
    For those users looking for a more detailed solution on MultiIndexing, [check this answer below](https://stackoverflow.com/a/47381551/3707607) – Ted Petrou Nov 19 '17 at 20:16
  • If you are interested in learning more about slicing and filtering multiindex DataFrames, please take a look at my post: [How do I slice or filter MultiIndex DataFrame levels?](https://stackoverflow.com/questions/53927460/how-do-i-slice-or-filter-multiindex-dataframe-levels). – cs95 Jan 05 '19 at 07:09

2 Answers2

12

Note: In soon to be released 0.13 a drop_level argument has been added to xs (thanks to this question!):

In [42]: df.xs('sat', level='day', drop_level=False)
Out[42]:
                     sales
year flavour    day
2008 strawberry sat     10

Another option is to use select (which extracts a sub-DataFrame (copy) of the same data, i.e. it has the same index and so can be updated correctly):

In [11]: d.select(lambda x: x[2] == 'sat') * 2
Out[11]:
                     sales
year flavour    day
2008 strawberry sat     20
     banana     sat     44
2009 strawberry sat     22
     banana     sat     46

In [12]: d.update(d.select(lambda x: x[2] == 'sat') * 2)

Another option is to use an apply:

In [21]: d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)

Another option is to use get_level_values (this is probably the most efficient way of these):

In [22]: d[d.index.get_level_values('day') == 'sat'] *= 2

Another option is promote the 'day' level to a column and then use an apply.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Yes, it's a working solution but using `lambda`s for such a (seemingly) simple task feels wrong. The SQL equivalent is `UPDATE table SET col = col * 2 WHERE day = 'sat'`. I wonder whether `xs` should include an option to keep the index level selected on. – LondonRob Jul 09 '13 at 16:40
  • @LondonRob yeah (update is the thing I don't really like, lambdas aren't too bad), it does I was wondering the same thing, perhaps worth adding as an issue (if we're quick it could be in 0.12 out next week). Definitely some scope for improvement. – Andy Hayden Jul 09 '13 at 16:43
  • 1
    As a side comment, it seems like using a MultiIndex makes everything harder. I can't really work out what it makes easier!! – LondonRob Jul 09 '13 at 16:47
  • ha ha! Yes. Well, that would be amazing, but I imagine the developers will be scrambling to get stuff finished, rather than wanting to add new functionality. – LondonRob Jul 09 '13 at 16:48
  • I think the thing it makes faster is indexing, but you're right it makes calculations harder (perhaps using the index in the calculation isn't best practice?_) – Andy Hayden Jul 09 '13 at 17:26
  • 1
    @LondonRob `d[d.index.get_level_values('day') == 'sat'] *= 2` is I think faster... – Andy Hayden Jul 10 '13 at 00:09
  • Not means to bother,but how the `d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)` work? The param x is the `d.index.names`? – ileadall42 Oct 13 '17 at 07:13
  • @Tangfeifan kinda, it's a row/Series and the name attribute (for a MultiIndex) is a tuple. – Andy Hayden Oct 13 '17 at 16:02
  • @AndyHaydenThanks!Got it. – ileadall42 Oct 14 '17 at 03:25
9

Detailed MultiIndexing Explanation

You can use the .loc indexer to select subsets of data from a DataFrame with a MultiIndex. Assuming we have the data from the original question:

                     sales
year flavour    day       
2008 strawberry sat     10
                sun     12
     banana     sat     22
                sun     23
2009 strawberry sat     11
                sun     13
     banana     sat     23
                sun     24

This DataFrame has 3 levels in its index and each level has a name (year, flavour and day). The levels are also implicitly given integer locations beginning with 0 from the outside. So, the year level can be referenced as 0, flavour with 1, and day as 2.

Selecting from level 0 - the outermost level

Level 0 is the easiest level to make a selection with. For instance, if we wanted to select just the year 2008, we could do the following:

df.loc[2008]

                sales
flavour    day       
strawberry sat     10
           sun     12
banana     sat     22
           sun     23

This drops the outer index level. If you wanted to keep the outer level, you could pass your selection as a list (or a slice):

df.loc[[2008]]  # df.loc[2008:2008] gets the same result

                     sales
year flavour    day       
2008 strawberry sat     10
                sun     12
     banana     sat     22
                sun     23

Making selections from the other levels

Making selections from any level other than level 0 is more complicated. Let's begin by selecting a specific combination like the year 2008, banana and sat. To do this, you pass the combination as a tuple to .loc:

df.loc[(2008, 'banana', 'sat')]

sales    22
Name: (2008, banana, sat), dtype: int64

I always use parentheses like the above but Python will automatically interpret any comma-separated set of values as tuple so the following will get the same result:

df.loc[2008, 'banana', 'sat']

All levels were dropped and a Series returned. We can keep the levels by passing the tuple inside of a list:

df.loc[[(2008, 'banana', 'sat')]]

                  sales
year flavour day       
2008 banana  sat     22

Selecting multiple values from a particular level

The previous example made a single selection from each level. It's possible to use a list to contain all the values of the level you desire. For instance, if we wanted to select all rows with year 2008 and 2009, with banana flavour and on saturday and sunday, we could do the following:

df.loc[([2008, 2009], 'banana', ('sat','sun'))]

                  sales
year flavour day       
2008 banana  sat     22
             sun     23
2009 banana  sat     23
             sun     24

Again, you don't have to wrap the whole selection in paraentheses to denote a tuple and can simply do:

df.loc[[2008, 2009], 'banana', ('sat','sun')]

Selecting all values from a particular level.

You may instead want to select all values from a particular level. For instance, let's try to select all the years, all the flavours and just saturday. You might think the following would work:

df.loc[:, :, 'sat']

But, this is met with a 'too many indexer's IndexError. There are three different ways to select all values from a particular level.

  • df.loc[(slice(None), slice(None), 'sat'), :]
  • df.loc(axis=0)[:, :, 'sat']
  • df.loc[pd.IndexSlice[:, :, 'sat'], :]

All three yield the following:

                     sales
year flavour    day       
2008 strawberry sat     10
     banana     sat     22
2009 strawberry sat     11
     banana     sat     23
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • `df.loc(axis=0)[:, :, 'sat']` Woah.... Okay. Can you explain or point me to docs on using axis parameter with `loc`? +1 – Scott Boston Nov 21 '17 at 13:55
  • @ScottBoston Yea, I didn't know it either until I [read the advanced indexing docs again](http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers). You'll have to go down a page or so from there. Looks like it was added [way back in 0.14](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-14-0-may-31-2014) from may, 2014 – Ted Petrou Nov 21 '17 at 14:35
  • @TedPetrou please can you add selectors which allow setting all values at a particular level? – Steve Lorimer Jun 29 '18 at 13:32
  • I'm trying `df.loc[('2008', 'banana', 'sat'), 'sales']` and getting this error: KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['00'], dtype='object', name='master_part_no'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike" – Jonathan Biemond Sep 14 '20 at 18:43