8

I use Python with numpy.

I have a numpy array of indexes a:

>>> a
array([[5, 7],
       [12, 18],
       [20, 29]])
>>> type(a)
<type 'numpy.ndarray'>

I have a numpy array of indexes b:

>>> b
array([[2, 4],
       [8, 11],
       [33, 35]])
>>> type(b)
<type 'numpy.ndarray'>

I need to join an array a with an array b:

a + b => [2, 4] [5, 7] [8, 11] [12, 18] [20, 29] [33, 35]

=> a and b there are arrays of indexes => [2, 18] [20, 29] [33, 35]

( indexes ([2, 4][5, 7][8, 11][12, 18]) go sequentially

=> 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18 => [2, 18] )

For this example:

>>> out_c
array([[2, 18],
       [20, 29],
       [33, 35]])

Can someone please suggest, how do I get out_c?

Update: @Geoff suggested solution python union of multiple ranges. Whether this solution the fastest and best in large data arrays?

Community
  • 1
  • 1
Olga
  • 1,395
  • 2
  • 25
  • 34

3 Answers3

5

(New Answer) Using Numpy

ranges = np.vstack((a,b))
ranges.sort(0)

# List of non-overlapping ranges
nonoverlapping = (ranges[1:,0] - ranges[:-1,1] > 1).nonzero()[0]

# Starts are 0, and all the starts not overlapped by their predecessor
starts = np.hstack(([0], nonoverlapping + 1))

# Ends are -1 and all the ends who aren't overlapped by their successor
ends = np.hstack(( nonoverlapping, [-1]))

# Result
result = np.vstack((ranges[starts, 0], ranges[ends, 1])).T

(Old answer) Using lists and sets

import numpy as np
import itertools

def ranges(s):
    """ Converts a list of integers into start, end pairs """
    for a, b in itertools.groupby(enumerate(s), lambda(x, y): y - x):
        b = list(b)
        yield b[0][1], b[-1][1]

def intersect(*args):
    """ Converts any number of numpy arrays containing start, end pairs 
        into a set of indexes """
    s = set()
    for start, end in np.vstack(args):
        s = s | set(range(start,end+1))
    return s

a = np.array([[5,7],[12, 18],[20,29]])
b = np.array([[2,4],[8,11],[33,35]])

result = np.array(list(ranges(intersect(a,b))))

References

Community
  • 1
  • 1
Geoff
  • 7,935
  • 3
  • 35
  • 43
  • What did you find to not work with your new solution? I was poking at it right now... – Jaime Apr 04 '13 at 22:49
  • With the "non overlapping" caveat, I'd say it works fine, doesn't it? – Jaime Apr 04 '13 at 23:02
  • To sort it keeping every start with its end, you can do `ranges.view(dtype=[('',ranges.dtype),]*2).sort(axis=0)`. – Jaime Apr 04 '13 at 23:20
  • It has trouble, i.e. it doesn't work, with inputs like `ranges = np.array([[2, 7], [3, 12], [4, 11]])`, not sure if there is a better way than what I came up with to get the `max` of the ends... Or maybe the OP can guarantee this type of overlapping will never happen. – Jaime Apr 04 '13 at 23:26
  • Are you sure? The call to sort is needed, as is. – Geoff Apr 05 '13 at 01:25
  • Be sure to test it well. I'm skeptical about its simplicity. Also, how many ranges do you have? Any solution that has `sort` in it is at least `O(n log n)`, but I believe it can be done in `O(n)`. – Geoff Apr 05 '13 at 13:02
  • At this point I'm pretty confident that my method works. I also think that the best you can get is `O(n log* n)` but you'll need a fancy data-structure. Are you satisfied with the answer? – Geoff Apr 05 '13 at 17:49
3

Not pretty, but it works. I don't like the final loop, buy couldn't think of a way of doing without it:

ab = np.vstack((a,b))
ab.sort(axis=0)

join_with_next = ab[1:, 0] - ab[:-1, 1] <= 1
endpoints = np.concatenate(([0],
                            np.where(np.diff(join_with_next) == True)[0]  + 2,
                            [len(ab,)]))
lengths = np.diff(endpoints)
new_lengths = lengths.copy()
if join_with_next[0] == True:
    new_lengths[::2] = 1
else:
    new_lengths[1::2] = 1
new_endpoints = np.concatenate(([0], np.cumsum(new_lengths)))
print endpoints, lengths
print new_endpoints, new_lengths

starts = endpoints[:-1]
ends = endpoints[1:]
new_starts = new_endpoints[:-1]
new_ends = new_endpoints[1:]
c = np.empty((new_endpoints[-1], 2), dtype=ab.dtype)

for j, (s,e,ns,ne) in enumerate(zip(starts, ends, new_starts, new_ends)):
    if e-s != ne-ns:
        c[ns:ne] = np.array([np.min(ab[s:e, 0]), np.max(ab[s:e, 1])])
    else:
        c[ns:ne] = ab[s:e]

>>> c
array([[ 2, 18],
       [20, 29],
       [33, 35]])
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • You inspired me to keep trying. If you're still interested in this problem, will you check my answer? It seems too simple to be right. – Geoff Apr 04 '13 at 22:41
  • `ab.sort(0)` ruins the association between start and end pairs. A real sort should be `ab = ab[ab[:,0].argsort(), :]` – Geoff Apr 04 '13 at 23:01
  • @Geoff Take a view of the array as a structured array, then it will sort properly (methinks). – Jaime Apr 04 '13 at 23:04
  • Well, I'm getting confused. I revealed my new answer. Maybe we can break it. It's definitely a short answer... – Geoff Apr 04 '13 at 23:08
  • @Jaime for `a = np.array([[0,1],[5, 7]])` and `b = np.array([[3,4]])` resul: `out_c = [[0 1] [3 4] [5 7]]`, and it would be `out_c = [[0 1] [3 7]]` – Olga Apr 05 '13 at 08:04
1

perhaps you could try to use numpy.concatenate() to join the arrays together, and then find the mininum and maximum of each row...then create c as a matrix of the min and max of each row.

alternatively, np.minimum and np.maximum compares two arrays and finds the minimum and maximum, so you could find the minimum and maximum for each row then assign it to the matrix c.

mtigger
  • 2,397
  • 3
  • 15
  • 17
  • I think it's a coincidence that the two arrays `a` and `b` both have three rows. This is a good point though, if I'm wrong. – Geoff Apr 03 '13 at 13:39