You can use view
to change the dtype of M
so that an entire row (or column) will be viewed as an array of bytes. Then np.unique
can be applied to find the unique values:
import numpy as np
def asvoid(arr):
"""
View the array as dtype np.void (bytes).
This views the last axis of ND-arrays as np.void (bytes) so
comparisons can be performed on the entire row.
http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)
Some caveats:
- `asvoid` will work for integer dtypes, but be careful if using asvoid on float
dtypes, since float zeros may compare UNEQUALLY:
>>> asvoid([-0.]) == asvoid([0.])
array([False], dtype=bool)
- `asvoid` works best on contiguous arrays. If the input is not contiguous,
`asvoid` will copy the array to make it contiguous, which will slow down the
performance.
"""
arr = np.ascontiguousarray(arr)
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
def nodistinctcols(M):
MT = asvoid(M.T)
uniqs = np.unique(MT)
return len(uniqs)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))
Benchmark:
In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop
In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop
where nodistinctcols_orig
is defined by:
def nodistinctcols_orig(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
Sanity check passes:
In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)
By the way, it might make more sense to define
def num_distinct_rows(M):
return len(np.unique(asvoid(M)))
and simply pass M.T
to the function when you wish to count the number of distinct columns. That way, the function would not be slowed down by an unnecessary transpose if you wish to use it to count the number of distinct rows.