I have two sorted np.arrays, say
1, 2, 3, 5
and
2, 4, 6, 7
I want
1 2 2 3 4 5 6 7
Don't want python for loops.
Is there some numpy function for this?
Bonus: do this for matrices by some axis (other axes have the same shape)
I have two sorted np.arrays, say
1, 2, 3, 5
and
2, 4, 6, 7
I want
1 2 2 3 4 5 6 7
Don't want python for loops.
Is there some numpy function for this?
Bonus: do this for matrices by some axis (other axes have the same shape)
Import sortednp and you're good to go:
import numpy as np
import sortednp as s
a = np.array([1,2,3,5])
b = np.array([2,4,6,7])
m = s.merge(a, b)
print(m)
Abusing the sorted nature of the input arrays, we can use np.searchsorted
, like so -
def merge_sorted_arrays(a, b):
m,n = len(a), len(b)
# Get searchsorted indices
idx = np.searchsorted(a,b)
# Offset each searchsorted indices with ranged array to get new positions
# of b in output array
b_pos = np.arange(n) + idx
l = m+n
mask = np.ones(l,dtype=bool)
out = np.empty(l,dtype=np.result_type(a,b))
mask[b_pos] = False
out[b_pos] = b
out[mask] = a
return out
Sample run (using a generic case of duplicates) -
In [52]: a
Out[52]: array([1, 2, 3, 3, 5, 9, 9, 9])
In [53]: b
Out[53]: array([ 2, 4, 6, 6, 6, 7, 10])
In [54]: merge_sorted_arrays(a, b)
Out[54]: array([ 1, 2, 2, 3, 3, 4, 5, 6, 6, 6, 7, 9, 9, 9, 10])
Timimgs on random-sorted 1000000
sized arrays -
Benchmarking against the popular concatenate+sort method.
# Setup
In [141]: np.random.seed(0)
...: a = np.sort(np.random.randint(0,1000000,(1000000)))
...: b = np.sort(np.random.randint(0,1000000,(1000000)))
# @chmod777's soln
In [142]: %%timeit
...: c = np.concatenate((a,b), axis=0)
...: c.sort()
141 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [143]: %timeit merge_sorted_arrays(a, b)
55.1 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The numpy devs recently implemented tim sort. As tim sort checks for presorted sub-sequences it will sort the concatenation of two sorted arrays in O(n).
For some reason it is currently not possible to directly select tim sort, but at least in some cases kind="stable"
results in tim sort being used. Integer types use radix sort which is also O(n).
See official docs for more detail https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html
At size 10^6 this results in a 10-fold speedup compared to the default sorting method (qsort, I believe).
tim sort/radix sort is also only a tiny bit slower than sortednp.merge
.
# default sort
def so():
c = np.concatenate((a,b))
c.sort()
return c
# tim sort / radix sort
def ts():
c = np.concatenate((a,b))
c.sort(kind="stable")
return c
from sortednp import merge
# extra library
def mg():
return merge(a,b)
# @Divakar's example (range enlarged)
a = np.sort(np.random.randint(0,100000000,(1000000)))
b = np.sort(np.random.randint(0,100000000,(1000000)))
timeit(so,number=10)
# 1.5669178580283187
timeit(ts,number=10)
# 0.12706473504658788
timeit(mg,number=10)
# 0.12382328097010031
# for comparison @Divakar's solution
timeit(lambda:merge_sorted_arrays(a,b),number=10)
# 0.5367169310338795
# non integer example
a = np.sort(np.random.random(1000000))
b = np.sort(np.random.random(1000000))
timeit(so,number=10)
# 1.7868053679703735
timeit(ts,number=10)
# 0.17676723399199545
timeit(mg,number=10)
# 0.1376464170170948
# and @Divakar
timeit(lambda:merge_sorted_arrays(a,b),number=10)
# 0.5656043770140968