94

I have a data frame df which looks like this. Date and Time are 2 multilevel index

                           observation1   observation2
date          Time                             
2012-11-02    9:15:00      79.373668      224
              9:16:00      130.841316     477
2012-11-03    9:15:00      45.312814      835
              9:16:00      123.776946     623
              9:17:00      153.76646      624
              9:18:00      463.276946     626
              9:19:00      663.176934     622
              9:20:00      763.77333      621
2012-11-04    9:15:00      115.449437     122
              9:16:00      123.776946     555
              9:17:00      153.76646      344
              9:18:00      463.276946     212

I want to run some complex process over daily data block.

Pseudo code would look like

 for count in df(level 0 index) :
     new_df = get only chunk for count
     complex_process(new_df)

So, first of all, I could not find a way to access only blocks for a date

2012-11-03    9:15:00      45.312814      835
              9:16:00      123.776946     623
              9:17:00      153.76646      624
              9:18:00      463.276946     626
              9:19:00      663.176934     622
              9:20:00      763.77333      621

and then send it for processing. I am doing this in for loop as I am not sure if there is any way to do it without mentioning exact value of level 0 column. I did some basic search and found df.index.get_level_values(0), but it returns all the values and that causes loop to run multiple times for a given day. I want to create a Dataframe per day and send it for processing.

Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Yantraguru
  • 3,604
  • 3
  • 18
  • 21

5 Answers5

152

One easy way would be to groupby the first level of the index - iterating over the groupby object will return the group keys and a subframe containing each group.

In [136]: for date, new_df in df.groupby(level=0):
     ...:     print(new_df)
     ...:     
                    observation1  observation2
date       Time                               
2012-11-02 9:15:00     79.373668           224
           9:16:00    130.841316           477

                    observation1  observation2
date       Time                               
2012-11-03 9:15:00     45.312814           835
           9:16:00    123.776946           623
           9:17:00    153.766460           624
           9:18:00    463.276946           626
           9:19:00    663.176934           622
           9:20:00    763.773330           621

                    observation1  observation2
date       Time                               
2012-11-04 9:15:00    115.449437           122
           9:16:00    123.776946           555
           9:17:00    153.766460           344
           9:18:00    463.276946           212

You can also use droplevel to remove the first index (the useless date index):

In [136]: for date, new_df in df.groupby(level=0):
     ...:     print(new_df.droplevel(0))
     ...:
         observation1  observation2
Time
9:15:00     79.373668           224
9:16:00    130.841316           477
...
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • 1
    That helps!. I was following rather roundabout way - first finding index lable and then slicing it using iloc. – Yantraguru Sep 21 '14 at 14:08
  • 1
    Where has this been for the last 3 years of my life. Such a clean method thank you! – rgk Sep 05 '20 at 18:04
  • This method is neat, but how to apply changes from these subframes to the main dataframe? Searching it through loc or iloc makes the computations extremely slow. – Manaslu Sep 23 '20 at 06:56
  • @Manaslu if you can wrap the changes into a function you can use `df.groupby('key').apply(function)` – bendl Oct 23 '20 at 14:23
  • Great! Fast and clean way to solve the problem. This is much faster than iterating over dates and getting `df.loc[date]`. – Hunaphu Aug 20 '21 at 12:17
  • I am facing a similar problem. Can this multiindex dataframe be grouped by observation instead of date? – Soumya C Feb 18 '22 at 21:45
  • @SoumyaC You can group by "time" with `df.groupby(level=1)` (or `df.groupby(level="Time")` ... or iterate column by column with `df.items()` – Jean-Francois T. May 20 '22 at 06:40
8

What about this?

for idate in df.index.get_level_values('date'):
    complex_process(df.ix[idate], idate)
psorenson
  • 259
  • 2
  • 5
  • 1
    Careful with this solution, note that each value of idate can be hit multiuple times. You should be doing for idate in np.unique(df.index.get_level_values('date')): Note tha additional np.unique. – H. Brandsmeier Jun 02 '16 at 09:02
  • 2
    I think `df.index.get_level_values('date').unique()` may better as @melbay pointed out. – Nate Stemen Jan 11 '18 at 19:53
5

Tagging off of @psorenson answer, we can get unique level indices and its related data frame slices without numpy as follows:

for date in df.index.get_level_values('date').unique():
    print(df.loc[date])
melbay
  • 59
  • 1
  • 1
3

Late to the party, I found that the following works, too:

for date in df.index.unique("date"):
    print(df.loc[date])

It uses the level optional parameter of the Index.unique method introduced in version 0.23.0.

You can specify either the level number or label.

sanzoghenzo
  • 598
  • 5
  • 21
2

Another alternative:

for date in df.index.levels[0]:
    print(df.loc[date])

The difference with the df.index.unique("date") proposed by @sanzoghenzo is that it refers to the index level by its number rather than name.

Roger Vadim
  • 373
  • 2
  • 12