1

I am trying to find a nice clean Python/Pandas way to take the following dataframe:

ID  Date_Field Category 
12345 01/01/12 AAAA
12345 01/02/12 AAAA
12345 01/03/12 AAAA
12345 01/04/12 BBBB
12345 01/05/12 BBBB
12345 01/06/12 BBBB
2468  01/01/12 AAAA
2468  01/02/12 AAAA
2468  01/03/12 AAAA
2468  01/04/12 BBBB
2468  01/05/12 BBBB
2468  01/06/12 BBBB
2468  01/07/12 BBBB
2468  01/08/12 CCCC
2468  01/09/12 CCCC
2468  01/10/12 AAAA
2468  01/11/12 AAAA

And convert it to a dataframe that looks similar to this, i.e., where I have the begin/end date per ID/Category, but with the caveat that ID/Category combo can occur multiple times if there was a date gap between:

ID  Start_Date End_Date Category
12345 01/01/12 01/03/12 AAAA
12345 01/04/12 01/06/12 BBBB
2468  01/01/12 01/03/12 AAAA
2468  01/10/12 01/11/12 AAAA
2468  01/04/12 01/07/12 BBBB
2468  01/08/12 01/09/12 CCCC

I know I can do it via iteration and comparison to previous row values, but I have a feeling there's a far cleaner approach.

Anton Protopopov
  • 30,354
  • 12
  • 88
  • 93
user1624577
  • 547
  • 2
  • 6
  • 15

1 Answers1

3

You can do a groupby. If you want only one column you can do:

>>> df.groupby(['ID','Category']).min()

               Date_Field
ID    Category           
2468  AAAA       01/01/12
      BBBB       01/04/12
      CCCC       01/08/12
12345 AAAA       01/01/12
      BBBB       01/04/12

And to also do the max you can do like this answer:

>>> df.groupby(['ID','Category']).agg({'Date_Field' : [min, max]})

               Date_Field          
                      min       max
ID    Category                     
2468  AAAA       01/01/12  01/11/12
      BBBB       01/04/12  01/07/12
      CCCC       01/08/12  01/09/12
12345 AAAA       01/01/12  01/03/12
      BBBB       01/04/12  01/06/12


Edit:

The groupby results in a DataFrame with the grouped columns as index, to disable this add as_index=False:

>>> df.groupby(['ID','Category'], as_index=False).agg({'Date_Field' : [min, max]})

      ID Category Date_Field          
                         min       max
0   2468     AAAA   01/01/12  01/11/12
1   2468     BBBB   01/04/12  01/07/12
2   2468     CCCC   01/08/12  01/09/12
3  12345     AAAA   01/01/12  01/03/12
4  12345     BBBB   01/04/12  01/06/12


Edit 2:

To separate the different time intervals of the data with the same ID and Category we will have to add a new category:

df['Cat2']=0
for i in range(2,len(df)):
   if df['Category'].iloc[i]==df['Category'][i-1]:
     df['Cat2'].iloc[i]=df['Cat2'].iloc[i-1]
   else:
     df['Cat2'].iloc[i]=df['Cat2'].iloc[i-1]+1

This for-loop creates a new Cat2 column which will have the same value if the previous row (i-1) has the same Category, otherwise a new value is created (previous+1). Note that this method purely depends on the order of the data! Now we can add this to the groupby category:

 >>> df.groupby(['ID','Category','Cat2'], as_index=False).agg({'Date_Field' : [min, max]})

      ID Category Cat2 Date_Field          
                              min       max
0   2468     AAAA    2   01/01/12  01/03/12
1   2468     AAAA    5   01/10/12  01/11/12
2   2468     BBBB    3   01/04/12  01/07/12
3   2468     CCCC    4   01/08/12  01/09/12
4  12345     AAAA    0   01/01/12  01/03/12
5  12345     BBBB    1   01/04/12  01/06/12
Community
  • 1
  • 1
agold
  • 6,140
  • 9
  • 38
  • 54