Fastest way to find unique rows of 2D NumPy array with Cython

Question

I have a 2D NumPy array that could be of any type, but for this example, we can assume it is integers. I am looking to find the fastest way to find all the unique rows in the array.

My initial strategy was to convert each row into a tuple and add this to a set. If the length of the set increased, this would mean a unique row was found.

What I don't know how to do is quickly hash each row as bytes. There is a question where an entire array is hashed here.

What I tried - tuple creation

There are many ways to create a tuple, and each one impacts performance. Here is my function that I show 4 different variations:

Version 1:

def unique_int_tuple1(ndarray[np.int64_t, ndim=2] a):
    cdef int i, len_before
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')

    for i in range(nr):
        len_before = len(s)
        s.add(tuple(a[i]))        # THIS LINE IS CHANGED FOR ALL VERSIONS
        if len(s) > len_before:
            idx[i] = True
    return idx

Version 2:

s.add(tuple([a[i, j] for j in range(nc)]))

Version 3:

vals is a list with length equal to the number of columns

for j in range(nc):
    vals[j] = a[i, j]
    s.add(tuple(vals))

Version 4:

s.add((a[i, 0], a[i, 1], a[i, 2], a[i, 3]))

Performance

a = np.random.randint(0, 8, (10**5, 4))
%timeit unique_int_tuple1(a)
125 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit unique_int_tuple2(a)
14.5 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit unique_int_tuple3(a)
11.7 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit unique_int_tuple4(a)
9.59 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoiding the tuple constructor (version 4) results in a nice performance gain.

Using `tostring`

From the linked SO question above, I can use the tostring method on each row and then hash this.

def unique_int_tostring(ndarray[np.int64_t, ndim=2] a):
    cdef int i, j
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')

    for i in range(nr):
        len_before = len(s)
        s.add(a[i].tostring())
        if len(s) > len_before:
            idx[i] = True
    return idx

This works but is very slow:

%timeit unique_int_tostring(a)
40 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using typed memoryview

A huge part of the slowdown, I believe, is the access of each row a[i]. We can used typed memoryviews to increase performance, but I don't know how to turn elements of typed memoryviews into strings so they can be hashed.

def unique_int_memoryview(long[:, :] a):
    cdef int i, j
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    for i in range(nr):
        s.add(<SOMETHING>)   # NO IDEA HERE
    return s

You might get some improvement by doing `a[i,:]` instead of `a[i]` (for both `ndarray` and for memoryviews) - I doubt if it'll be much though — DavidW, Feb 19 '18 at 20:58
@DavidW No, unfortunately that doesn't help. Other ideas include turning the entire array into a string before the loop. Also, I'm not sure turning a row into a string will guarantee uniqueness. — Ted Petrou, Feb 19 '18 at 21:04
`np.unique(a, axis=0)` is coming out at `20.6 ms ± 77.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)` for me. Maybe that's a starting basis though? I'm not sure if that method can be used in Cython. — roganjosh, Feb 19 '18 at 21:06
@roganjosh np.unique is very slow (unless the data has few duplicates) and sorts the data first. I'm looking for a hash based solution. — Ted Petrou, Feb 19 '18 at 21:19
Why not use your own hash function? I've had success implementing the FNV hash in pure cython: https://github.com/yt-project/yt/blob/c1569367c6e3d8d0a02e10d0f3d0bd701d2e2114/yt/utilities/lib/fnv_hash.pyx — ngoldbaum, Feb 19 '18 at 22:57
How many columns do you have? Always 4? or maybe as much as 10? Or can there be let's say 1000 columns, which is a completely different scenario? — ead, Feb 20 '18 at 04:43
@ead There can be any number of columns in theory but typically its less than 100. They can also be any type and in reality I will have several 2D arrays each of a different type. Essentially, I have a table of data from a database and want to find the unique rows. — Ted Petrou, Feb 20 '18 at 16:07

HYRY · Answer 1 · 2018-02-20T03:22:28.737

You can use ndarray.view() to change the dtype to byte string, and then use pandas.Series.duplicated() to find duplicated rows:

import numpy as np

a = np.random.randint(0, 5, size=(200, 3))
s = pd.Series(a.view(("S", a[0].nbytes))[:, 0])
s.duplicated()

the core algorithm of duplicated() is implemented in Cython. However it need to convert the original array to an object array, which maybe slow.

To skip object array, you can use the khash library that used by Pandas directly, here is the C code:

#include "khash.h"

typedef struct _Buf{
    unsigned short n;
    char * pdata;
} Buf;

khint32_t kh_buf_hash_func(Buf key)
{
    int i;
    char * s;
    khint32_t hash = 0;
    s = key.pdata;
    for(i=0;i<key.n;i++)
    {
        hash += *s++;
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);    
    return hash;
}

khint32_t kh_buf_hash_equal(Buf a, Buf b)
{
    int i;
    if(a.n != b.n) return 0;
    for(i=0;i<a.n;i++){
        if(a.pdata[i] != b.pdata[i]) return 0;
    }
    return 1;
}

KHASH_INIT(buf, Buf, char, 0, kh_buf_hash_func, kh_buf_hash_equal)


void duplicated(char * arr, int row_size, int count, char * res)
{
    kh_buf_t * khbuf;
    Buf row;
    int i, absent;
    khint_t k;
    row.n = row_size;

    khbuf = kh_init_buf();
    kh_resize_buf(khbuf, 4 * count);

    for(i=0;i<count;i++){
        row.pdata = &arr[i * row_size];
        k = kh_put_buf(khbuf, row, &absent);
        if (absent){
            res[i] = 0;
        }
        else{
            res[i] = 1;
        }
    }    
    kh_destroy_buf(khbuf);
}

then wrap the duplicated() function by Cython or Ctypes or cffi.

@TedPetrou, I modified the answer to include the c code that can be wrapped by Cython. — HYRY, Feb 20 '18 at 03:24

chrisb · Accepted Answer · 2018-02-20T17:31:22.307

This, surprising to me, is slower, but for whatever it's worth, here is a c++ solution that does what you were pointing at - hash each row as a set of bytes. The 'trick' is taking the address of an element <char*>&a[i, 0] - most everything else is book-keeping.

I may be doing some obviously sub-optimal and/or performance is likely better with a different hash table impl.

Edit:

re: how to create a string from a row I think the best you could do is this - construct a bytes object from the pointer. This does necessarily involve a copy of the row see c api docs.

%%cython
from numpy cimport *
cimport numpy as np
import numpy as np
from cpython.bytes cimport PyBytes_FromStringAndSize

def unique_int_string(ndarray[np.int64_t, ndim=2] a):
    cdef int i, len_before
    cdef int nr = a.shape[0]
    cdef int nc = a.shape[1]
    cdef set s = set()
    cdef ndarray[np.uint8_t, cast = True] idx = np.zeros(nr, dtype='bool')
    cdef bytes string

    for i in range(nr):
        len_before = len(s)
        string = PyBytes_FromStringAndSize(<char*>&a[i, 0], sizeof(np.int64_t) * nc)
        s.add(string)
        if len(s) > len_before:
            idx[i] = True
    return idx

// timing

In [9]: from unique import unique_ints

In [10]: %timeit unique_int_tuple4(a)
100 loops, best of 3: 10.1 ms per loop

In [11]: %timeit unique_ints(a)
100 loops, best of 3: 11.9 ms per loop

In [12]: (unique_ints(a) == unique_int_tuple4(a)).all()
Out[12]: True

// helper.h

#include <unordered_set>
#include <cstring>

struct Hasher {
    size_t size;
    size_t operator()(char* buf) const {
        // https://github.com/yt-project/yt/blob/c1569367c6e3d8d0a02e10d0f3d0bd701d2e2114/yt/utilities/lib/fnv_hash.pyx
        size_t hash_val = 2166136261;
        for (int i = 0; i < size; ++i) {
                hash_val ^= buf[i];
                hash_val *= 16777619;
        }
        return hash_val;
    }
};
struct Comparer {
    size_t size;
    bool operator()(char* lhs, char* rhs) const {
        return (std::memcmp(lhs, rhs, size) == 0) ? true : false;
    }
};

struct ArraySet {
    std::unordered_set<char*, Hasher, Comparer> set;

    ArraySet (size_t size) : set(0, Hasher{size}, Comparer{size}) {}
    ArraySet () {}

    bool add(char* buf) {
        auto p = set.insert(buf);
        return p.second;
    }
};

// unique.pyx

from numpy cimport int64_t, uint8_t
import numpy as np

cdef extern from 'helper.h' nogil:
    cdef cppclass ArraySet:
        ArraySet()
        ArraySet(size_t)
        bint add(char*)


def unique_ints(int64_t[:, :] a):
    cdef:
        Py_ssize_t i, nr = a.shape[0], nc = a.shape[1]
        ArraySet s = ArraySet(sizeof(int64_t) * nc)
        uint8_t[:] idx = np.zeros(nr, dtype='uint8')

        bint found;

    for i in range(nr):
        found = s.add(<char*>&a[i, 0])
        if found:
            idx[i] = True

    return idx

// setup.py

from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np

exts = [
  Extension('unique', ['unique.pyx'], language='c++', include_dirs=[np.get_include()])
]

setup(name='test', ext_modules=cythonize(exts))

Thanks for this very detailed answer. It seems you created a custom hasher to hash multiple integers. Could you have used a `vector` instead? Also, I've seen that the default`unordered_set` has worse performance than Python's `set`. — Ted Petrou, Feb 20 '18 at 16:28
More importantly, I have a question on the 'trick' `&a[i, 0]` (thanks very much for this). How do I convert an entire row directly to a string? Attempting: `&a[i]` produces a compile error: `Cannot take address of memoryview slice` — Ted Petrou, Feb 20 '18 at 16:31
1) `std::hash` isn't defined by default for vectors either, so would have to define a custom hasher in that case too — chrisb, Feb 20 '18 at 17:15
Wow, you are some kind of wizard. Very cool. I tested the new solution and it's 50% slower than the tuple of ints. Shouldn't there be a faster way to convert a row to a string? @HYRY solution above uses `a.view(("S", a[0].nbytes))` — Ted Petrou, Feb 20 '18 at 17:42
I think that's a fast as it gets (could be wrong!) - python strings own their memory, so they must take a copy of the buffer, where-as @HYRY is creating a numpy bytes type that views back into the original data, similar to what I was attempting with the c++ version. — chrisb, Feb 20 '18 at 17:45
Ok, this is fantastic. I was wrong about the timing. In my actual dataset I am getting much better performance with your version, so the number of uniques and row length must make a difference. The performance is also much better for floats (got 4x better on a mostly unique array of 10**5 by 10) — Ted Petrou, Feb 20 '18 at 18:54
By the way, I am attempting to build a simpler and more performant version of pandas called [dexplo](https://github.com/dexplo/dexplo) and have made all operations faster except for grouping by floats and large range ints. Grouping by strings is already faster just by iterating through sets and now this addition should pull it further ahead. If you are interested in helping out, let me know. — Ted Petrou, Feb 20 '18 at 18:58