20

Given two sorted arrays like the following:

a = array([1,2,4,5,6,8,9])

b = array([3,4,7,10])

I would like the output to be:

c = array([1,2,3,4,5,6,7,8,9,10])

or:

c = array([1,2,3,4,4,5,6,7,8,9,10])

I'm aware that I can do the following:

c = unique(concatenate((a,b))

I'm just wondering if there is a faster way to do it as the arrays I'm dealing with have millions of elements.

Any idea is welcomed. Thanks

hobs
  • 18,473
  • 10
  • 83
  • 106
Jun
  • 307
  • 1
  • 2
  • 8
  • I really doubt that you'll do any better without writing a compiled extension to combine `concatenate` `unique` and `sort`. – mgilson Sep 14 '12 at 15:12
  • You can at least drop the `sort` as the output from `unique` is guaranteed to already be sorted. – Fred Foo Sep 14 '12 at 15:36

8 Answers8

36

Since you use numpy, I doubt that bisec helps you at all... So instead I would suggest two smaller things:

  1. Do not use np.sort, use c.sort() method instead which sorts the array in place and avoids the copy.
  2. np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag.
  3. I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.

This would make the full thing close to what you got (without doing a unique beforehand):

def insort(a, b, kind='mergesort'):
    # took mergesort as it seemed a tiny bit faster for my sorted large array try.
    c = np.concatenate((a, b)) # we still need to do this unfortunatly.
    c.sort(kind=kind)
    flag = np.ones(len(c), dtype=bool)
    np.not_equal(c[1:], c[:-1], out=flag[1:])
    return c[flag]
seberg
  • 8,785
  • 2
  • 31
  • 30
  • If you roll these suggestions into a function, I'd be happy to add it to the timeit framework I posted in my answer. – mgilson Sep 14 '12 at 15:39
  • @mgilson added it, but if you want to compare it, compare it with large arrays... small are likely off the point for this purpose, too much smaller overhead that does not matter for this application. – seberg Sep 14 '12 at 15:53
  • Tested -- currently you're in the lead for an array size 10000. – mgilson Sep 14 '12 at 16:03
  • Yes, yours is by far the fastest – Jun Sep 14 '12 at 16:29
  • 1
    And I have to say -- This answer is pretty remarkable. (As documented in my comment on the original question) I didn't think you could improve on the code posted by the OP very much. Well done. (hopefully others will see this and upvote gratuitously). – mgilson Sep 14 '12 at 17:11
  • To be honest, didn't expect much myself, but I guess for the given data, the mergesort appearently makes quite some difference. (unless for the OP, swapping gets a problem). Actualy the OP's, code is equivalent to `np.union1d`... – seberg Sep 14 '12 at 17:30
12

Inserting elements into the middle of an array is a very inefficient operation as they're flat in memory, so you'll need to shift everything along whenever you insert another element. As a result, you probably don't want to use bisect. The complexity of doing so would be around O(N^2).

Your current approach is O(n*log(n)), so that's a lot better, but it's not perfect.

Inserting all the elements into a hash table (such as a set) is something. That's going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)). Still not great.

The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.

It would look vaguely like the following:

def foo(numpy.ndarray[int, ndim=1] out,
        numpy.ndarray[int, ndim=1] in1, 
        numpy.ndarray[int, ndim=1] in2):

        cdef int i = 0
        cdef int j = 0
        cdef int k = 0
        while (i!=len(in1)) or (j!=len(in2)):
            # set out[k] to smaller of in[i] or in[j]
            # increment k
            # increment one of i or j
jleahy
  • 16,149
  • 6
  • 47
  • 66
  • Agreed, writing a small Cython extension like this is a nice option. – seberg Sep 14 '12 at 16:07
  • 1
    Great efficiency improvement, as long as your assumption of presorted inputs is correct. It seems the OP intended that. – hobs Sep 08 '15 at 19:57
  • 1
    You can always check that your inputs are sorted, that only takes `O(n)` time, so doesn't increase the complexity. – jleahy Sep 10 '15 at 22:16
9

When curious about timings, it's always best to just timeit. Below, i've listed a subset of the various methods and their timings:

import numpy as np
import timeit
import heapq



def insort(a, x, lo=0, hi=None):
    if hi is None: hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid]: hi = mid
        else: lo = mid+1
    return lo, np.insert(a, lo, [x])

size=10000
a = np.array(range(size))
b = np.array(range(size))

def op(a,b):
    return np.unique(np.concatenate((a,b)))

def martijn(a,b):
    c = np.copy(a)
    lo = 0
    for i in b:
        lo, c = insort(c, i, lo)
    return c

def martijn2(a,b):
    c = np.zeros(len(a) + len(b), a.dtype)
    for i, v in enumerate(heapq.merge(a, b)):
        c[i] = v

def larsmans(a,b):
    return np.array(sorted(set(a) | set(b)))

def larsmans_mod(a,b):
    return np.array(set.union(set(a),b))


def sebastian(a, b, kind='mergesort'):
    # took mergesort as it seemed a tiny bit faster for my sorted large array try.
    c = np.concatenate((a, b)) # we still need to do this unfortunatly.
    c.sort(kind=kind)
    flag = np.ones(len(c), dtype=bool)
    np.not_equal(c[1:], c[:-1], out=flag[1:])
    return c[flag]

Results:

martijn2     25.1079499722
OP       1.44831800461
larsmans     9.91507601738
larsmans_mod     5.87612199783
sebastian    3.50475311279e-05

My specific contribution here is larsmans_mod which avoids creating 2 sets -- it only creates 1 and in doing so cuts execution time nearly in half.

EDIT removed martijn as it was too slow to compete. Also tested for slightly bigger arrays (sorted) input. I also have not tested for correctness in output ...

mgilson
  • 300,191
  • 65
  • 633
  • 696
4

In addition to the other answer on using bisect.insort, if you are not content with performance, you may try using blist module with bisect. It should improve the performance.

Traditional list insertion complexity is O(n), while blist's complexity on insertion is O(log(n)).

Also, you arrays seem to be sorted. If so, you can use merge function from heapq mudule to utilize the fact that both arrays are presorted. This approach will take an overhead because of crating a new array in memory. It may be an option to consider as this solution's time complexity is O(n+m), while the solutions with insort are O(n*m) complexity (n elements * m insertions)

import heapq

a = [1,2,4,5,6,8,9]
b = [3,4,7,10]


it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
L = list(it) #list made of it
print(L)

Output:

[1, 2, 3, 4, 4, 5, 6, 7, 8, 9, 10]

If you want to delete repeating values, you can use groupby:

import heapq
import itertools

a = [1,2,4,5,6,8,9]
b = [3,4,7,10]


it = heapq.merge(a,b) #iterator consisting of merged elements of a and b
it = (k for k,v in itertools.groupby(it))
L = list(it) #list made of it
print(L)

Output:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
ovgolovin
  • 13,063
  • 6
  • 47
  • 78
  • looks great, but I'm worrying if it will be as memory efficient as numpy array – Jun Sep 14 '12 at 15:24
  • @Jun You should do some test to determine the memory consumption. The problem with traditional arrays and list is that to insert they require to move all the elements after the insertion by 1. That requires `O(n)`. In `blist` they use another data structure, so that complexity is `O(log(n))`. It's will change things dramatically for large arrays and lists. – ovgolovin Sep 14 '12 at 15:28
2

You could use the bisect module for such merges, merging the second python list into the first.

The bisect* functions work for numpy arrays but the insort* functions don't. It's easy enough to use the module source code to adapt the algorithm, it's quite basic:

from numpy import array, copy, insert

def insort(a, x, lo=0, hi=None):
    if hi is None: hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if x < a[mid]: hi = mid
        else: lo = mid+1
    return lo, insert(a, lo, [x])

a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])

c = copy(a)
lo = 0
for i in b:
    lo, c = insort(c, i, lo)

Not that the custom insort is really adding anything here, the default bisect.bisect works just fine too:

import bisect

c = copy(a)
lo = 0
for i in b:
    lo = bisect.bisect(c, i)
    c = insert(c, i, lo)

Using this adapted insort is much more efficient than a combine and sort. Because b is sorted as well, we can track the lo insertion point and search for the next point starting there instead of considering the whole array each loop.

If you don't need to preserve a, just operate directly on that array and save yourself the copy.

More efficient still: because both lists are sorted, we can use heapq.merge:

from numpy import zeros
import heapq

c = zeros(len(a) + len(b), a.dtype)
for i, v in enumerate(heapq.merge(a, b)):
    c[i] = v
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Just the code I just came up with. But I realize that if the array 'b' is millions of elements long, then the machine will reassign the array 'c' in your code millions of times. This doesn't look very efficient to me – Jun Sep 14 '12 at 15:37
  • @Jun: Yeah, I don't like insert that much, so I've found a better method still using a pre-allocated array and heapq.merge. – Martijn Pieters Sep 14 '12 at 15:40
1

Use the bisect module for this:

import bisect

a = array([1,2,4,5,6,8,9])
b = array([3,4,7,10])

for i in b:
    pos = bisect.bisect(a, i)
    insert(a,[pos],i) 

I can't test this right now, but it should work

Charlie G
  • 814
  • 9
  • 22
1

The sortednp package implements an efficient merge of sorted numpy-arrays, just sorting the values, not making them unique:

import numpy as np
import sortednp
a = np.array([1,2,4,5,6,8,9])
b = np.array([3,4,7,10])
c = sortednp.merge(a, b)

I measured the times and compared them in this answer to a similar post where it outperforms numpy's mergesort (v1.17.4).

0

Seems like no one mentioned union1d (union1d). Currently, it is a shortcut for unique(concatenate((ar1, ar2))), but its a short name to remember and it has a potential to be optimized by numpy developers since its a library function. It performs very similar to insort from seberg's accepted answer for large arrays. Here is my benchmark:

import numpy as np

def insort(a, b, kind='mergesort'):
    # took mergesort as it seemed a tiny bit faster for my sorted large array try.
    c = np.concatenate((a, b))  # we still need to do this unfortunatly.
    c.sort(kind=kind)
    flag = np.ones(len(c), dtype=bool)
    np.not_equal(c[1:], c[:-1], out=flag[1:])
    return c[flag]

size = int(1e7)
a = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)
b = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size)

np.testing.assert_array_equal(insort(a, b), np.union1d(a, b))

import timeit
repetitions = 20
print("insort: %.5fs" % (timeit.timeit("insort(a, b)", "from __main__ import a, b, insort", number=repetitions)/repetitions,))
print("union1d: %.5fs" % (timeit.timeit("np.union1d(a, b)", "from __main__ import a, b; import numpy as np", number=repetitions)/repetitions,))

Output on my machine:

insort: 1.69962s
union1d: 1.66338s
otognan
  • 1,736
  • 3
  • 16
  • 20