I have a 2D NumPy array that could be of any type, but for this example, we can assume it is integers. I am looking to find the fastest way to find all the unique rows in the array.
My initial strategy was to convert each row into a tuple and add this to a set. If the length of the set increased, this would mean a unique row was found.
What I don't know how to do is quickly hash each row as bytes. There is a question where an entire array is hashed here.
What I tried - tuple creation
There are many ways to create a tuple, and each one impacts performance. Here is my function that I show 4 different variations:
Version 1:
def unique_int_tuple1(ndarray[np.int64_t, ndim=2] a):
cdef int i, len_before
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
for i in range(nr):
len_before = len(s)
s.add(tuple(a[i])) # THIS LINE IS CHANGED FOR ALL VERSIONS
if len(s) > len_before:
idx[i] = True
return idx
Version 2:
s.add(tuple([a[i, j] for j in range(nc)]))
Version 3:
vals
is a list with length equal to the number of columns
for j in range(nc):
vals[j] = a[i, j]
s.add(tuple(vals))
Version 4:
s.add((a[i, 0], a[i, 1], a[i, 2], a[i, 3]))
Performance
a = np.random.randint(0, 8, (10**5, 4))
%timeit unique_int_tuple1(a)
125 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit unique_int_tuple2(a)
14.5 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit unique_int_tuple3(a)
11.7 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit unique_int_tuple4(a)
9.59 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoiding the tuple
constructor (version 4) results in a nice performance gain.
Using tostring
From the linked SO question above, I can use the tostring
method on each row and then hash this.
def unique_int_tostring(ndarray[np.int64_t, ndim=2] a):
cdef int i, j
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
for i in range(nr):
len_before = len(s)
s.add(a[i].tostring())
if len(s) > len_before:
idx[i] = True
return idx
This works but is very slow:
%timeit unique_int_tostring(a)
40 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using typed memoryview
A huge part of the slowdown, I believe, is the access of each row a[i]
. We can used typed memoryviews to increase performance, but I don't know how to turn elements of typed memoryviews into strings so they can be hashed.
def unique_int_memoryview(long[:, :] a):
cdef int i, j
cdef int nr = a.shape[0]
cdef int nc = a.shape[1]
cdef set s = set()
for i in range(nr):
s.add(<SOMETHING>) # NO IDEA HERE
return s