Get unique intersection values of two sets

Question

I'd like to get the indexes of unique vectors using hash (for matrices it is efficient) but np.intersect1d does not give indices, it gives values. np.in1d on the other hand does give indices but not unique ones. I zipped a dict to make it work but it doesn't seem like the most efficient. I am new to python so trying to see if there is a better way to do this. Thanks for the help!

code:

import numpy as np
import hashlib
x=np.array([[1, 2, 3],[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y=np.array([[4, 5, 6], [7, 8, 9],[1, 2, 3]])

xhash=[hashlib.sha1(row).digest() for row in x]
yhash=[hashlib.sha1(row).digest() for row in y]
z=np.intersect1d(xhash,yhash)

idx=list(range(len(xhash)))

d=dict(zip(xhash,idx))
unique_idx=[d[i] for i in z] #is there a better way to get this or boolean array
print(unique_idx)
uniques=np.array([x[i] for i in unique_idx])
print(uniques)

output:

>>> [2, 3, 1]
[[4 5 6]
 [7 8 9]
 [1 2 3]]

I'm having a similar issue for np.unique() where it doesn't give me any indexes.

Get the row indices from [`answers posted to this question`](http://stackoverflow.com/questions/38674027/find-the-row-indexes-of-several-values-in-a-numpy-array) and just index into the first array with those indices for your desired o/p. — Divakar, Sep 20 '16 at 05:07

Eelco Hoogendoorn · Answer 1 · 2016-09-20T11:27:49.210

1

The numpy_indexed package (disclaimer: I am its author) has efficient functionality for doing things like this (and related functionality):

import numpy_indexed as npi
uniques = npi.intersection(x, y)

Note that this solution does not use hashing, but bitwise equality of the elements of the sequence; so no risk of hash collisions, and likely a lot faster in practice.

edited Sep 20 '16 at 11:27

answered Sep 20 '16 at 06:40

Eelco Hoogendoorn

10,459
1
44
42

I downloaded via conda install numpy-indexed -c conda-forge but I still get ImportError: No module named 'numpy_indexed' – Rik Sep 20 '16 at 11:18
strange... is it listed when you call conda list on your env? – Eelco Hoogendoorn Sep 20 '16 at 11:23
just created a fresh conda env myself using 'conda create -n testnpi python numpy-indexed', and this is what i get when i call conda list: 'numpy-indexed 0.3.4 py35_0 conda-forge'; importing numpy_indexed works fine. Have you run 'conda update conda' lately? – Eelco Hoogendoorn Sep 20 '16 at 11:27
Hmm, yea I tried that but I get this: rik@rik-MS-7971:~$ conda create -n testnpi python numpy-indexed Fetching package metadata ....... Solving package specifications: . PackageNotFoundError: Package not found: '' Package missing in current linux-64 channels: - numpy-indexed You can search for packages on anaconda.org with anaconda search -t conda numpy-indexed – Rik Sep 20 '16 at 23:06
Sorry a but of a newbie not sure how to do this stuff – Rik Sep 20 '16 at 23:07
pip install should also work if you pip install pyyaml first – Eelco Hoogendoorn Sep 21 '16 at 06:25
wrt PackageNotFoundError; I have conda-forge in my .condarc file, but I suppose conda create has a -c flag as well. – Eelco Hoogendoorn Sep 21 '16 at 06:26

score 1 · Answer 2 · answered Sep 20 '16 at 11:19

Use np.unique's return_index property to return flags for the unique values given by in1d

code:

import numpy as np
import hashlib
x=np.array([[1, 2, 3],[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y=np.array([[1, 2, 3], [7, 8, 9]])
xhash=[hashlib.sha1(row).digest() for row in x]
yhash=[hashlib.sha1(row).digest() for row in y]
z=np.in1d(xhash,yhash)

##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)

##Compute indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=(np.array(idx)[z])[unique]

print('x=',x)
print('unique_idx=',unique_idx)
print('x[unique_idx]=',x[unique_idx])

Output:

x= [[1 2 3]
 [1 2 3]
 [4 5 6]
 [7 8 9]]
unique_idx= [3 0]
x[unique_idx]= [[7 8 9]
 [1 2 3]]

Get unique intersection values of two sets

2 Answers2