Pandas nunique equivalent with NumPy

Question

Is there a pandas equivalent nunique row wise in numpy? I checked out np.unique with return_counts but it doesn't seem to return what I want. For example

a = np.array([[120.52971, 75.02052, 128.12627], [119.82573, 73.86636, 125.792],
       [119.16805, 73.89428, 125.38216],  [118.38071, 73.35443, 125.30198],
       [118.02871, 73.689514, 124.82088]])
uniqueColumns, occurCount = np.unique(a, axis=0, return_counts=True) ## axis=0 row-wise

The results:

>>>ccurCount
array([1, 1, 1, 1, 1], dtype=int64)

I should be expecting all 3 as opposed to all 1.

The work around of course is convert to pandas and call nunique but there is a speed issue and I want to explore a pure numpy implementation to speed things up. I am working with large dataframes so hoping to find speedups whereever I can. I am open to other solutions too for speed up.

Just to verify/confirm what you are expecting, what would be the pandas solution? Would it be `pd.DataFrame(a).nunique()`? — Divakar, Feb 05 '20 at 14:25
Sorry about the dupe. I had completely misunderstood your question. — Mad Physicist, Feb 05 '20 at 14:30
@MadPhysicist so I am doing row-wise unique count. There are 5 rows so I am expecting a length 5 array where each element computes the number of unique values of the nth row. Let me know if the question can be worded better. thanks — user1234440, Feb 05 '20 at 14:35
`np.unique(a, axis=0)` gives you the unique rows not the unique elements per row — scleronomic, Feb 05 '20 at 14:37
oh correct, my bad thanks for the correction, i totally mis-understood. Ill change that — user1234440, Feb 05 '20 at 14:38
Does this answer your question? [Number of unique elements per row in a NumPy array](https://stackoverflow.com/questions/48473056/number-of-unique-elements-per-row-in-a-numpy-array) — anishtain4, Feb 05 '20 at 15:16

Divakar · Answer 1 · 2020-02-05T14:52:33.350

We can use some sorting and consecutive differences -

a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)

For some perf. boost, we can use slicing to replace np.diff -

a_s = np.sort(a,axis=1)
out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)

If you want to introduce some tolerance value for checking unique-ness, we can use np.isclose -

a.shape[1]-(np.isclose(np.diff(np.sort(a,axis=1),axis=1),0)).sum(1)

Sample run -

In [51]: import pandas as pd

In [48]: a
Out[48]: 
array([[120.52971 , 120.52971 , 128.12627 ],
       [119.82573 ,  73.86636 , 125.792   ],
       [119.16805 ,  73.89428 , 125.38216 ],
       [118.38071 , 118.38071 , 118.38071 ],
       [118.02871 ,  73.689514, 124.82088 ]])

In [49]: pd.DataFrame(a).nunique(axis=1).values
Out[49]: array([2, 3, 3, 1, 3])

In [50]: a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
Out[50]: array([2, 3, 3, 1, 3])

Timings on a simplistic case with random numbers and at least 2 unique numbers per row -

In [41]: np.random.seed(0)
    ...: a = np.random.rand(10000,5)
    ...: a[:,-1] = a[:,0]

In [42]: %timeit pd.DataFrame(a).nunique(axis=1).values
    ...: %timeit a.shape[1]-(np.diff(np.sort(a,axis=1),axis=1)==0).sum(1)
1.31 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
758 µs ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [43]: %%timeit
    ...: a_s = np.sort(a,axis=1)
    ...: out = a.shape[1]-(a_s[:,:-1] == a_s[:,1:]).sum(1)
694 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas nunique equivalent with NumPy

1 Answers1