3

I have a really simple problem and data structure but the number is so large that I need to find an efficient way.

Suppose I have an object that has an attribute which is an interval. For example:

        `start      stop`
obj1      5          10
obj2      8          12
obj3      11         14
obj4      13         20
obj5      22         25
obj6      24         30
obj7      33         37
obj8      36         40

I want to merge it so that overlapping interval will become one object. So, the result of the example will become

         start        stop
objA        5          20
objB       22          30
objC       33          40

I am using python for this. Please notice that I have thousand of this type of data.

Bharata
  • 685
  • 3
  • 11
  • 23
  • 1
    What have you tried so far? Have you written any code? – Yakov Dan Nov 19 '18 at 09:35
  • Please specify whether your ranges are always sorted (as in your example) by starts and stops. – Aristide Nov 19 '18 at 09:49
  • 2
    Duplicate of [python union of multiple ranges](https://stackoverflow.com/questions/15273693/python-union-of-multiple-ranges). Please look at the answers of that question, they are exactly what you need, you just have to pass it to pandas. – silgon Nov 19 '18 at 09:56

3 Answers3

1
df['Startpoint'] = df['stop`'].shift() < df['`start'] # Begin of interval
df['Endpoint'] = df['Startpoint'].shift(-1) # End of interval
df.loc['obj1', 'Startpoint'] = True # First line is startpoint
df['Endpoint'].fillna(True, inplace=True) # Last line is endpoint

df2 = df[df[['Startpoint', 'Endpoint']].any(axis=1)]
df2['`start'] = df2['`start'].shift() 
df2.loc[df2['Endpoint'], ['`start', 'stop`']]

  #            `start  stop`
  #  obj4     5.0     20
  #  obj6    22.0     30
  #  obj8    33.0     40

Find all begins and ends of interval, keep only those rows, and then shift the start value one row so that the values per interval are in the same row.

This is all pandas, so I believe it should be reasonably quick.

Jondiedoop
  • 3,303
  • 9
  • 24
0

When the intervals are sorted by starts, this simple function should work in linear time:

def merge_intervals(intervals):
    result = []
    (start_candidate, stop_candidate) = intervals[0]
    for (start, stop) in intervals[1:]:
        if start <= stop_candidate:
            stop_candidate = max(stop, stop_candidate)
        else:
            result.append((start_candidate, stop_candidate))
            (start_candidate, stop_candidate) = (start, stop)
    result.append((start_candidate, stop_candidate))
    return result

intervals = [
    ( 5, 10),
    ( 8, 12),
    (11, 14),
    (13, 20),
    (22, 25),
    (24, 30),
    (33, 37),
    (36, 40),
]

assert merge_intervals(intervals) == [(5, 20), (22, 30), (33, 40)]
Aristide
  • 3,606
  • 2
  • 30
  • 50
-1

The quickest way to deal with this kind of data is to use Union find data structure or disjoint data structure which tracks a set of elements partitioned into a number of disjoint subsets. I am leaving the implementation and design of the data structure on you as there are efficient ways of implementing Disjoint data structures which operates in linear time.

mohor chatt
  • 356
  • 1
  • 8