Get index of where group starts and ends pandas

Question

I grouped my data by month. Now I need to know at which observation/index my group starts and ends. What I have is the following output where the second column represents the number of observation in each month:

date
01       145
02      2232
03     12785
04     16720
Name: date, dtype: int64

with this code:

leave.groupby([leave['date'].dt.strftime('%m')])['date'].count()

What I want though is an index range I could access later. Somehow like that (the format doesn't really matter and I don't mind if it returns a list or a data frame)

date
01       0 - 145
02      146 - 2378
03     2378 - 15163
04     15164 - 31884

yes exactly, (so that I can access it later for indexing in a loop) — LN_P, Jul 11 '18 at 14:27
This question would benefit from a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — C8H10N4O2, Jul 13 '18 at 13:47

gyx-hh · Answer 1 · 2018-07-11T17:07:01.750

1

try the following - using shift

df['data'] = df['data'].shift(1).add(1).fillna(0).apply(int).apply(str) + ' - ' + df['data'].apply(str)

OUTPUT:

     data
date    
1    0 - 145
2    146 - 2232
3    2233 - 12785
4    12786 - 16720
5    16721 - 30386
6    30387 - 120157

edited Jul 11 '18 at 17:07

answered Jul 11 '18 at 14:36

gyx-hh

1,421
1
10
15

You forget add `1` – jezrael Jul 11 '18 at 14:41
like `df['data'].shift(1).fillna(0).add(1).apply(int).apply(str)` – jezrael Jul 11 '18 at 14:42
You can avoid the `int` to `str` using string formatting and use `.cat` instead of `+` - eg: `df.date.shift(1).fillna(0).add(1).apply('{:.0f}'.format).str.cat(df.date.astype(str), sep=' - ')` – Jon Clements Jul 11 '18 at 14:52

C8H10N4O2 · Answer 2 · 2018-07-13T13:50:22.027

I think you are asking for a data frame containing the indices of first and last occurrences of each value.

How about something like this.

Example data (note -- it's better to include reproducible data in your question so I don't have to guess):

import pandas as pd
import numpy as np

np.random.seed(123)
n = 500
df = pd.DataFrame(
        {'date':pd.to_datetime(
                  pd.DataFrame( { 'year':  np.random.choice(range(2017,2019), size=n),
                                  'month': np.random.choice(range(1,13),      size=n),
                                  'day':   np.random.choice(range(1,28),      size=n)
                                 } )
         ) }  
    )

Approach:

pd.DataFrame( ( { '_month_':x,'firstIndex':y[0],'lastIndex':y[-1]} 
                for x, y in df.index.groupby(df['date'].dt.month).items() 
               )
             )

Result:

    _month_  firstIndex  lastIndex
0         1           0        495
1         2          21        499
2         3           1        488
3         4           5        498
4         5          14        492
5         6          12        470
6         7          15        489
7         8           2        494
8         9          18        475
9        10           3        491
10       11          10        473
11       12           7        497

If you are only going use it for indexing in a loop, you wouldn't have to wrap it in pd.DataFrame() -- you could just leave it as a generator.

That's exactly what I want. But unfortunately it doesn't work. I am working with dates, so my column I group by is a date column. I get the following error: TypeError: 'Series' objects are mutable, thus they cannot be hashed. `leave.sort_values(by='date', inplace=True) pd.DataFrame( ( { 'group':x,'firstIndex':y[0],'lastIndex':y[-1]} for x, y in leave.index.groupby([leave['date'].dt.strftime('%m')])['date'].items() ) )` — LN_P, Jul 12 '18 at 08:10
@LN_P instead of `.groupby([leave['date'].dt.strftime('%m')])['date'].items()`, try `.groupby([leave['date'].dt.month.items()`. No need to use `strftime` and not sure what you're doing with the `date` column there. I will update my answer. — C8H10N4O2, Jul 13 '18 at 13:46

Get index of where group starts and ends pandas

2 Answers2