5

I'm trying to implement a specific binary search algorithm. "Results" should be an empty set in the beginning, and during the search, Results variable will become a union with the new results that we get.

Basically:

results = set()
for result in search():
  results = results.union(result)

But such code won't really work with Numpy arrays, so we use np.union1d for this purpose:

results = np.array([])
for result in search():
    result = np.union1d(results, result)

The code above doesn't really work either, since if we have for example two vectors a = [1,2,3] and b=[3,4,5], np.union1d(a, b) will return:

[1, 2, 3, 4, 5]

But I want it to return:

[[1, 2, 3], [3,4,5]]

Since there are no duplicate vectors, if we had for example union([[1, 2, 3], [3,4,5]], [1,2,3]), return value shall remain:

[[1, 2, 3], [3,4,5]]


So I would say that I require a numpy array based union.

I also considered using np.append(a, b) and then np.unique(x), but both of the functions project lower dimensional array to higher dimensional one. np.append also has axis=0 property, which retains dimension of all arrays inserted, but I couldn't efficiently implement it without getting dimension error.

Question:

How can I efficiently implement a vector based set? So that points in the union will be considered as vectors instead of scalars, and will retain their vector form and dimension.

ShellRox
  • 2,532
  • 6
  • 42
  • 90
  • Possible duplicate https://stackoverflow.com/questions/16970982/find-unique-rows-in-numpy-array – Mazdak Dec 27 '18 at 19:28
  • 2
    How about converting the arrays to tuples? `tuple(arr.tolist())`. Python `set` wants hashable objects such as `tuples`. – hpaulj Dec 27 '18 at 19:32
  • @hpaulj Isn't `tolist()` method making the algorithm more inefficient? I've tried appending such tuples to array and they have greatly increased the time. I couldn't try it with sets since I'm getting "unhashable type" error. – ShellRox Dec 27 '18 at 19:48
  • @Kasrâmvd I did try `np.unique` as mentioned (axis parameter as well), though I'm not certain for how it can be efficiently implemented for high-dimensional arrays. (i.e how should initial vector be defined without getting dimension error) – ShellRox Dec 27 '18 at 19:49
  • `set` is quite efficient if you can give it hashable objects like tuples. The `numpy` set functions generally use `np.unique`, which is based on sorting the elements. `unique` originally worked with 1d arrays as `np.union1d` still does. It's been extended to take an `axis` parameter, but at its core it is still a 1d sort. – hpaulj Dec 27 '18 at 19:57
  • Stay away from `np.append`, especially in an iterative context. List append in a loop is fine, but joining arrays in a loop is slow. – hpaulj Dec 27 '18 at 20:00
  • @hpaulj I don't have that much experience in numpy, but what I wonder is if calling `tolist()` method and converting array to tuple would take a lot of time. Let's say I have 500 dimensional vectors, what I currently did was define an empty set: `results = set()`, and then after iterating through all vectors as "result" `results = results.union(tuple(result.ravel().tolist()))` and it takes a lot of time as expected. I have to use `ravel()`, because if I don't the points will be added to tuple as 1d vectors. – ShellRox Dec 27 '18 at 20:06

1 Answers1

1

Here's some basic set operations.

Define a pair of lists (they could be np.array([1,2,3]), but that's not what you show.

In [261]: a = [1,2,3]; b=[3,4,5]

A list of several of those:

In [263]: alist = [a, b, a]
In [264]: alist
Out[264]: [[1, 2, 3], [3, 4, 5], [1, 2, 3]]

I can get the unique values by converting to tuples and putting them in a set.

In [265]: set([tuple(i) for i in alist])
Out[265]: {(1, 2, 3), (3, 4, 5)}

I can also convert that list into a 2d array:

In [266]: arr = np.array(alist)
In [267]: arr
Out[267]: 
array([[1, 2, 3],
       [3, 4, 5],
       [1, 2, 3]])

and get the unique rows with unique and an axis parameter:

In [269]: np.unique(arr, axis=0)
Out[269]: 
array([[1, 2, 3],
       [3, 4, 5]])

Compare the times

In [270]: timeit np.unique(arr, axis=0)
46.5 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [271]: timeit set([tuple(i) for i in alist])
1.01 µs ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Converting array to list or list to array adds some time, but the basic pattern remains.

In [272]: timeit set([tuple(i) for i in arr.tolist()])
1.53 µs ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [273]: timeit np.unique(alist, axis=0)
53.3 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For larger, realistic sources relative timings may change a bit, but I expect that the set of tuples will be remain the best. Set operations are not a numpy strong point. unique does a sort, followed by a elimination of duplicates. set uses a hashing method, similar to what Python uses for dictionaries.

If you must collect values iteratively from a source, I'd suggest building a list, and doing the set/unique once.

alist = []
for x in source():
    alist.append(x)

or one of:

alist = [x for x in source()]
alist = list(source())
alist = [tuple(x) for x in source()]
alist = [tuple(x.tolist()) for x in source()]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Yes that's what I tried and it was much quicker. Though transforming numpy array into tuple takes some time, hence I will make `source()` a list full of tuples so that they are pre defined. Thank you! – ShellRox Dec 27 '18 at 20:40