What is a faster way to get the location of unique rows in numpy

Question

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...

import numpy


uniq_rows = numpy.array([[0, 1, 0],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 1]])

test_rows = numpy.array([[0, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0],
                         [0, 1, 0],
                         [0, 1, 1],
                         [0, 1, 1],
                         [1, 1, 1],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0]])

# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
    print row, numpy.where((test_rows == row).all(axis=1))[0]

This prints...

[0, 1, 0] [ 1  4 10]
[1, 1, 0] [ 3  8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]

Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.

EDIT: This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.

What are typical sizes (i.e. number of rows and columns) of `uniq_rows` and `test_rows`? — Warren Weckesser, Jun 26 '17 at 19:20
@Divakar No they could be any integer from 0 to 32000. I used 0 and 1 to simplify the example. — b10hazard, Jun 26 '17 at 19:21
@WarrenWeckesser The number of unique rows should be between 4 and 128. The number of rows in test_rows could be hundreds of thousands — b10hazard, Jun 26 '17 at 19:21
@Divakar I was away for a few days so I'm going over the solutions now. — b10hazard, Jun 28 '17 at 15:59

Maarten Fabré · Answer 1 · 2017-06-26T21:09:53.670

2

Numpy

From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)

Pandas

df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)

uniq

    0   1   2
0   0   1   0
1   1   1   0
2   1   1   1
3   0   1   1

Or you could generate the unique rows automatically from the incoming DataFrame

uniq_generated = df.drop_duplicates().reset_index(drop=True)

yields

    0   1   2
0   0   1   1
1   0   1   0
2   0   0   0
3   1   1   0
4   1   1   1

and then look for it

d = dict()
for idx, row in uniq.iterrows():
    d[idx] = df.index[(df == row).all(axis=1)].values

This is about the same as your where method

d

{0: array([ 1,  4, 10], dtype=int64),
 1: array([ 3,  8, 12], dtype=int64),
 2: array([7, 9], dtype=int64),
 3: array([0, 5, 6], dtype=int64)}

edited Jun 26 '17 at 21:09

answered Jun 26 '17 at 20:52

Maarten Fabré

6,938
1
17
36

Didn't know that they had changed np.unique in 1.13. I'll look into it. I cannot use the Pandas solution, pandas does a poor job cleaning up after itself so I cannot use it – b10hazard Jun 28 '17 at 15:12
What do you mean with `pandas does a poor job cleaning up after itself`? – Maarten Fabré Jun 28 '17 at 15:19
1

I ran into an issue where I was using pandas for a project and I could not get it to release all it's memory. I posted my question here: https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe The posted solutions never worked for me so I don't use Pandas for production code any longer. – b10hazard Jun 28 '17 at 16:00
No reason to use pandas when numpy does the job just fine. `numpy.unique` seems to be the better solution here. – user2699 Jun 28 '17 at 18:45

score 2 · Answer 2 · answered Jun 28 '17 at 18:40

There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.

np.where((uniq_rows[:, None, :] == test_rows).all(2))

Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.

 (array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
  array([ 1,  4, 10,  3,  8, 12,  7,  9,  0,  5,  6]))

How it works:

(uniq_rows[:, None, :] == test_rows)

Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.

this is the best anwser, thanks, simple enough, also get two indices systems, very impressive!!! — Jiadong, Jun 03 '20 at 02:39

hpaulj · Answer 3 · 2017-06-27T00:53:00.097

With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)

In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]: 
(array([[0, 0, 0],
        [0, 1, 0],
        [0, 1, 1],
        [1, 1, 0],
        [1, 1, 1]]),
 array([2, 1, 0, 3, 7], dtype=int32),
 array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))

In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
     ...:     dd[v].append(i)
     ...:     
In [167]: dd
Out[167]: 
defaultdict(list,
            {0: [2, 11],
             1: [1, 4, 10],
             2: [0, 5, 6],
             3: [3, 8, 12],
             4: [7, 9]})

or indexing the dictionary with the unique rows (as hashable tuple):

In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
     ...:     dd[tuple(a[v])].append(i)
     ...:     
In [172]: dd
Out[172]: 
defaultdict(list,
            {(0, 0, 0): [2, 11],
             (0, 1, 0): [1, 4, 10],
             (0, 1, 1): [0, 5, 6],
             (1, 1, 0): [3, 8, 12],
             (1, 1, 1): [7, 9]})

I tried this but my timeit tests showed it was slower than my solution. — b10hazard, Jun 28 '17 at 15:13
@b10hazard, I'd be interested if you would add the timing results to the question. — user2699, Jun 28 '17 at 18:47

Divakar · Answer 4 · 2017-06-28T16:14:56.783

Approach #1

Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -

def get1Ds(a, b): # Get 1D views of each row from the two inputs
    # check that casting to void will create equal size elements
    assert a.shape[1:] == b.shape[1:]
    assert a.dtype == b.dtype

    # compute dtypes
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))

    # convert to 1d void arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
    b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
    return a_void, b_void

def matching_row_indices(uniq_rows, test_rows):
    A, B = get1Ds(uniq_rows, test_rows)
    validA_mask = np.in1d(A,B)

    sidx_A = A.argsort()
    validA_mask = validA_mask[sidx_A]    

    sidx = B.argsort()
    sortedB = B[sidx]
    split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
    all_split_indx = np.split(sidx, split_idx)

    match_mask = np.in1d(B,A)[sidx]
    valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])    
    locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]

    return uniq_rows[sidx_A[validA_mask]], locations

Scope(s) of improvement (on performance) :

np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.

Sample run -

In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)

In [332]: unq_rows
Out[332]: 
array([[0, 1, 0],
       [0, 1, 1],
       [1, 1, 0],
       [1, 1, 1]])

In [333]: idx
Out[333]: [array([ 1,  4, 10]),array([0, 5, 6]),array([ 3,  8, 12]),array([7, 9])]

Approach #2

Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -

A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
    idx_group.append(np.flatnonzero(B == row))

v1.13 has added an `axis` parameter to `unique`. Have you tried it yet? — hpaulj, Jun 26 '17 at 23:26
@hpaulj I still need to get hold on that latest version. So, no I haven't yet. — Divakar, Jun 26 '17 at 23:27
Approach 2 is about 6-fold faster than my solution. I'm still playing around with it but so far it looks solid. — b10hazard, Jun 28 '17 at 15:58
@b10hazard Awesome! Removed that `print` from approach #2 and collecting the indices from each group in a list and should be easier to time now. — Divakar, Jun 28 '17 at 16:15

Miriam Farber · Answer 5 · 2017-06-26T21:14:19.677

This will do the job:

import numpy as np
uniq_rows = np.array([[0, 1, 0],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 1]])

test_rows = np.array([[0, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0],
                         [0, 1, 0],
                         [0, 1, 1],
                         [0, 1, 1],
                         [1, 1, 1],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0]])

indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
    res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]

This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part

res=[[] for i in range(len(uniq_rows))]
    for i in range(len(indices)):
        res[loc[i]].append(indices[i])

with just the row:

res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)

Thus avoiding loops entirely.

jeremycg · Answer 6 · 2017-06-26T22:09:58.880

Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:

test_rowsdict = {}
for i,j in enumerate(test_rows):
    test_rowsdict.setdefault(tuple(j),[]).append(i)

test_rowsdict
{(0, 0, 0): [2, 11],
 (0, 1, 0): [1, 4, 10],
 (0, 1, 1): [0, 5, 6],
 (1, 1, 0): [3, 8, 12],
 (1, 1, 1): [7, 9]}

Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:

out = []
for i in uniq_rows:
    out.append((i, test_rowsdict.get(tuple(i),[])))

For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.

score 0 · Answer 7 · answered Jun 28 '17 at 16:55

The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:

import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)

If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:

subset_indices = npi.indices(unique_test_rows, unique_rows)

As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?

EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:

def encode(rows):
    return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)

What is a faster way to get the location of unique rows in numpy

7 Answers7

Numpy

Pandas

Linked