9

For example, for

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

I want to get

[2, 2, 3]

Is there a way to do this without for loops or using np.vectorize?

Edit: Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.

DrApe
  • 101
  • 1
  • 4

4 Answers4

13

Approach #1

One vectorized approach with sorting -

In [8]: b = np.sort(a,axis=1)

In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])

Approach #2

Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -

n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

Runtime test

Approaches as funcs -

def sorting(a):
    b = np.sort(a,axis=1)
    return (b[:,1:] != b[:,:-1]).sum(axis=1)+1

def bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[0])[:,None])*n
    M = a.shape[0]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

# From @wim's post   
def pandas(a):
    df = pd.DataFrame(a.T)
    return df.nunique()

# @jp_data_analysis's soln
def numpy_apply(a):
    return np.apply_along_axis(compose(len, np.unique), 1, a) 

Case #1 : Square shaped one

In [164]: np.random.seed(0)

In [165]: a = np.random.randint(0,5,(10000,10000))

In [166]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop

Case #2 : Large number of rows

In [167]: np.random.seed(0)

In [168]: a = np.random.randint(0,5,(1000000,10))

In [169]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop

Extending to number of unique elements per column

To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -

def nunique_percol_sort(a):
    b = np.sort(a,axis=0)
    return (b[1:] != b[:-1]).sum(axis=0)+1

def nunique_percol_bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[1]))*n
    M = a.shape[1]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

Generic ndarray with generic axis

Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -

def nunique(a, axis):
    return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1

Sample runs -

In [77]: a
Out[77]: 
array([[1, 0, 2, 2, 0],
       [1, 0, 1, 2, 0],
       [0, 0, 0, 0, 2],
       [1, 2, 1, 0, 1],
       [2, 0, 1, 0, 0]])

In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])

In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])

If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -

(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)

For a custom tolerance value, feed those with np.isclose.

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    Surely sorting is needlessly inefficient? Intuitively I thought this should be O(n) not O(n log n) – wim Jan 27 '18 at 06:09
  • @wim There's a better way without sorting? Well that sorting isn't done at Python level, so that O notation would be different. – Divakar Jan 27 '18 at 06:12
  • Not sure :) I haven't found one yet, but I'm confident there should be. – wim Jan 27 '18 at 06:13
  • 1
    Brilliant idea! – ChoF Jan 27 '18 at 06:21
  • What do you mean "that sorting isn't done at Python level, so that O notation would be different"? Why would the asymptotic complexity be different? – wim Jan 27 '18 at 06:51
  • @wim From what I know, this NumPy sorting would be at C level. So, if you are talking about perf. complexity, that O would be different than an O at Python level. Does that make sense? – Divakar Jan 27 '18 at 06:54
  • Not really, because complexity is a property of the algorithm, that's independent of the implementation language. For sure sorting in C will be faster than sorting in Python, but not a different [O](https://en.wikipedia.org/wiki/Big_O_notation). – wim Jan 27 '18 at 07:03
  • @wim Well we are talking about perf. here, so it matters which language it's implemented in. So, when talking about O notation, we need to include that into discussion :) – Divakar Jan 27 '18 at 07:10
  • *So, it's more likely that they are working with an array with a large number of rows. Hence, let's setup the input array keeping that in mind* <-- This is a somewhat gratuitous assumption, I've edited in some timings for a large square matrix so as not to unfairly bias against pandas here. – wim Jan 27 '18 at 18:03
  • @wim `bincount` one was feeling left out. Added timings for that one, to be *fair* on that one too. – Divakar Jan 27 '18 at 18:17
  • Thanks. The bincount is very clever, but you have to take care to make sure you can afford the memory, and you don't overflow the dtype! – wim Jan 27 '18 at 18:19
  • @wim As noted in the post. – Divakar Jan 27 '18 at 18:20
  • You mentioned about the memory. But you're only using integers from 1-5 so it's much harder to overflow, and you didn't mention that. – wim Jan 27 '18 at 18:21
  • Hmm, bincount has bugs. I tried it with `np.random.randint(0, 5000, (100,100))` and it crashed with a `ValueError`. – wim Jan 27 '18 at 18:29
  • Please include jp_data_analysis answer in timings too :) – wim Jan 27 '18 at 18:31
  • @wim Thanks! Was a bug indeed. Should be fixed now. Added timings for the other solution too. – Divakar Jan 27 '18 at 18:53
  • `bincount` is really good. Not only can we use the method in this answer to compute the number of uniques, but we can also know the most frequent value of each row by doing `np.argmax` on the result of `bincount`. – Fanchen Bao May 12 '23 at 21:12
5

This solution via np.apply_along_axis isn't vectorised and involves a Python-level loop. But it is relatively intuitive using len + np.unique functions.

import numpy as np
from toolz import compose

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

np.apply_along_axis(compose(len, np.unique), 1, a)    # [2, 2, 3]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    If you don't want to use the `toolz` package, you can use the following alternative code via the `lambda` function: `np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr=a)`. – aysljc Jun 19 '23 at 01:57
3

A oneliner using sort:

In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])
kuppern87
  • 1,125
  • 9
  • 14
2

Are you open to considering pandas? Dataframes have a dedicated method for this

>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3
wim
  • 338,267
  • 99
  • 616
  • 750