Speed up counting number of distinct columns

Question

I need to count the number of distinct columns in relatively large arrays.

def nodistinctcols(M):
    setofcols = set()
    for column in M.T:
        setofcols.add(repr(column))
    return len(setofcols)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])

print "nodistinctcols(X.T)", nodistinctcols(X.T)

The last line takes 20 seconds on my computer which seems excessively slow. By contrast X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) takes 216 ms. Can nodistinctcols be sped up?

unutbu · Accepted Answer · 2014-03-30T22:15:04.130

You can use view to change the dtype of M so that an entire row (or column) will be viewed as an array of bytes. Then np.unique can be applied to find the unique values:

import numpy as np

def asvoid(arr):
    """
    View the array as dtype np.void (bytes).

    This views the last axis of ND-arrays as np.void (bytes) so 
    comparisons can be performed on the entire row.
    http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)

    Some caveats:
        - `asvoid` will work for integer dtypes, but be careful if using asvoid on float
        dtypes, since float zeros may compare UNEQUALLY:
        >>> asvoid([-0.]) == asvoid([0.])
        array([False], dtype=bool)

        - `asvoid` works best on contiguous arrays. If the input is not contiguous,
        `asvoid` will copy the array to make it contiguous, which will slow down the
        performance.

    """
    arr = np.ascontiguousarray(arr)
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

def nodistinctcols(M):
    MT = asvoid(M.T)
    uniqs = np.unique(MT)
    return len(uniqs)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])

print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))

Benchmark:

In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop

In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop

where nodistinctcols_orig is defined by:

def nodistinctcols_orig(M):
    setofcols = set()
    for column in M.T:
        setofcols.add(repr(column))
    return len(setofcols)

Sanity check passes:

In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)

By the way, it might make more sense to define

def num_distinct_rows(M):
    return len(np.unique(asvoid(M)))

and simply pass M.T to the function when you wish to count the number of distinct columns. That way, the function would not be slowed down by an unnecessary transpose if you wish to use it to count the number of distinct rows.

That is a truly amazing speed up! The great thing about python is that there is always another way to do anything that is 50 times faster :) — Simd, Mar 31 '14 at 16:23
@felix: regarding your other question about circular plots: [here is a python-based solution](http://stackoverflow.com/q/5089030/190597) -- not nearly as fancy, but if you are generous, could be called somewhat similar :). — unutbu, Apr 02 '14 at 19:44

jtaylor · Answer 2 · 2014-03-31T19:45:25.613

if you have less rows than columns you can also do multiple stable sorts along the rows and the count the uniques

def count(x):
    x = x.copy()
    x = x[x[:,0].argsort()] # first sort can be unstable
    for i in range(1, x.shape[1]):
        x = x[x[:,i].argsort(kind='mergesort')] # stable sorts now
    # x is now sorted so that equal columns are next to each other
    # -> compare neighboors with each others and count all-true columns
    return x.shape[0] - np.count_nonzero((x[1:, :] == x[:-1,:]).all(axis=1))

with numpy 1.9.dev its faster than the void compare, with older numpys the indexing kills the performance (about 4 times slower than void)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
In [6]: %timeit count(X)
10 loops, best of 3: 144 ms per loop
Xt = X.T.copy()
In [8]: %timeit unutbu_void(Xt)
10 loops, best of 3: 161 ms per loop

score 2 · Answer 3 · answered Mar 31 '14 at 19:39

Just for future reference, don't sleep on old-fashioned approaches like using set. Will it be as fast and memory-efficient as a clever numpy approach? No. But often it's good enough for now, which is nothing to sneeze at when you're on the clock.

In [25]: %time slow = nodistinctcols(X.T)
CPU times: user 28.2 s, sys: 12 ms, total: 28.2 s
Wall time: 28.2 s

In [26]: %time medium = len(set(map(tuple, X)))
CPU times: user 324 ms, sys: 0 ns, total: 324 ms
Wall time: 322 ms

In [27]: slow == medium
Out[27]: True

What's slow wasn't the set part, it was the string conversion.

Now I want to accept two answers! This is great. Thanks. – Simd Mar 31 '14 at 19:44 — Simd, Mar 31 '14 at 19:44

Speed up counting number of distinct columns

3 Answers3