3

I have a dataframe that has intervals and a label associated with each. I need to group and aggregate rows separated by a given distance from others.

For example, groups rows whose start/end are within 3 units of the the start/end of other rows have their label fields concatenated:

In [16]: df = pd.DataFrame([
    ...:     [ 1, 3,'a'], [ 4,10,'b'],
    ...:     [15,17,'c'], [18,20,'d'],
    ...:     [27,30,'e'], [31,40,'f'], [41,42,'g'],
    ...:     [50,54,'h']],
    ...:     columns=['start', 'end', 'label'])
    ...:

In [17]: df
Out[17]:
   start  end label
0      1    3     a
1      4   10     b
2     15   17     c
3     18   20     d
4     27   30     e
5     31   40     f
6     41   42     g
7     50   54     h

Desired output:

In [18]: df_desired = group_by_interval(df)
In [19]: df_desired
Out[19]:
   start  end  label
0      1   10    a b
1     15   20    c d
2     27   30  e f g
3     50   54      h

How can I execute this sort of grouping by interval with a dataframe?

I have found one similar SO here, but it's a little different since I don't know where to cut a priori.

Vaishali
  • 37,545
  • 5
  • 58
  • 86
saladi
  • 3,103
  • 6
  • 36
  • 61
  • 1
    Is it guaranteed that `start: end` for one row never overlaps `start: end` for another row? It can become complicated if you can have something like `z 33:34` thrown in with your above `DataFrame` – ALollz Jan 25 '19 at 21:09
  • It's not ... I think I may want to open up another question for this though... – saladi Jan 30 '19 at 22:42
  • Yes, that situation is very complicated. – ALollz Jan 30 '19 at 22:50

1 Answers1

4

You can create a grouper based on the condition and aggregate

grouper = ((df['start'] - df['end'].shift()) > 3).cumsum()

df.groupby( grouper).agg({'start' : 'first', 'end' : 'last', 'label': lambda x: ' '.join(x)})

    start   end     label
0   1       10      a b
1   15      20      c d
2   27      42      e f g
3   50      54      h
Vaishali
  • 37,545
  • 5
  • 58
  • 86