6

I have a 'df' which have a multilevel index (STK_ID,RPT_Date)

                       sales         cogs     net_pft
STK_ID RPT_Date                                      
000876 20060331          NaN          NaN         NaN
       20060630    857483000    729541000    67157200
       20060930   1063590000    925140000    50807000
       20061231    853960000    737660000    51574000
       20070331  -2695245000  -2305078000  -167642500
       20070630   1146245000   1050808000   113468500
       20070930   1327970000   1204800000    84337000
       20071231   1439140000   1331870000    53398000
       20080331  -3135240000  -2798090000  -248054300
       20080630   1932470000   1777010000   133756300
       20080930   1873240000   1733660000    92099000
002254 20061231 -16169620000 -15332705000  -508333200
       20070331   -763844000   -703460000    -1538000
       20070630    501221000    289167000   118012200
       20070930    460483000    274026000    95967000

How to write a command to filter the rows whose 'RPT_Date' contains '0630' (which is the Q2 report) ? the result should be :

                       sales         cogs     net_pft
STK_ID RPT_Date                                      
000876 20060630    857483000    729541000    67157200
       20070630   1146245000   1050808000   113468500
       20080630   1932470000   1777010000   133756300
002254 20070630    501221000    289167000   118012200

I am trying to use df[df['RPT_Date'].str.contains('0630')], but Pandas refuses to work as 'RPT_Date' is not a column but a sub_level index.

Thanks for your tips ...

piRSquared
  • 285,575
  • 57
  • 475
  • 624
bigbug
  • 55,954
  • 42
  • 77
  • 96

1 Answers1

17

To use the "str.*" methods on a column, you could reset the index, filter rows with a column "str.*" method call, and re-create the index.

In [72]: x = df.reset_index(); x[x.RPT_Date.str.endswith("0630")].set_index(['STK_ID', 'RPT_Date'])
Out[72]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
000876 20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
002254 20070630   501221000   289167000  118012200

However, this approach is not particularly fast.

In [73]: timeit x = df.reset_index(); x[x.RPT_Date.str.endswith("0630")].set_index(['STK_ID', 'RPT_Date'])
1000 loops, best of 3: 1.78 ms per loop

Another approach builds on the fact that a MultiIndex object behaves much like a list of tuples.

In [75]: df.index
Out[75]: 
MultiIndex
[('000876', '20060331') ('000876', '20060630') ('000876', '20060930')
 ('000876', '20061231') ('000876', '20070331') ('000876', '20070630')
 ('000876', '20070930') ('000876', '20071231') ('000876', '20080331')
 ('000876', '20080630') ('000876', '20080930') ('002254', '20061231')
 ('002254', '20070331') ('002254', '20070630') ('002254', '20070930')]

Building on that, you can create a boolean array from a MultiIndex with df.index.map() and use the result to filter the frame.

In [76]: df[df.index.map(lambda x: x[1].endswith("0630"))]
Out[76]: 
                      sales        cogs    net_pft
STK_ID RPT_Date                                   
000876 20060630   857483000   729541000   67157200
       20070630  1146245000  1050808000  113468500
       20080630  1932470000  1777010000  133756300
002254 20070630   501221000   289167000  118012200

This is also quite a bit faster.

In [77]: timeit df[df.index.map(lambda x: x[1].endswith("0630"))]
1000 loops, best of 3: 240 us per loop
Garrett
  • 47,045
  • 6
  • 61
  • 50
  • Thank you for the tips. It is simple and elegant. And if I want to filter the Q1,Q3,Q4 together, which is "NOT endswith('0630')", how to add the 'NOT' to the command of "df[df.index.map(lambda x: x[1].endswith("0630"))] " ? – bigbug Sep 02 '12 at 14:05
  • 1
    Couple thoughts: 1) use ``not`` inside the lambda (e.g., ``df[df.index.map(lambda x: not x[1].endswith("0630"))]``). This works, but iterates over the index again. 2) if you save the result of df.index.map() into a variable, that variable will be a numpy array, which can be inverted with ``~`` (e.g. ``q2_mask = df.index.map(lambda x: x[1].endswith("0630")); df[~q2_mask]``). Hope that helps. – Garrett Sep 03 '12 at 19:07
  • Cool! Thanks man! Finding it for a while, it allows to filter by no trivial multi-index query like `df[df.index.map(lambda i: i[1] < another_df.xs(i[0])['DATE'])]`, here we take a value from first index component and select all rows less that corresponding value from another data frame for second index. – Alexander Nov 21 '18 at 00:05
  • 3
    You can also access the index level by name: `df[df.index.get_level_values('RPT_Date').str.endswith('0630')]` – Gregor Müllegger May 07 '19 at 13:24