For each row of a 2d numpy array get the index of an equal row in a second 2d array

Question

I have two huge 2d numpy integer arrays X and U, where U is assumed to have only unqiue rows. For each row in X I would like to get the corresponding row index of the matching row in U (if there is one, otherwise -1). E.g., if the following arrays are passed as inputs:

U = array([[1, 4],
       [2, 5],
       [3, 6]])

X = array([[1, 4],
       [3, 6],
       [7, 8],
       [1, 4]])

the output should be:

array([0,2,-1,0])

Is there an efficient way of doing this (or something similar) with Numpy?

@ Divakar: Your approach fails for me

print(type(rows), rows.dtype, rows.shape)
print(rows[:10])
print(search2D_indices(rows[:10], rows[:10]))

<class 'numpy.ndarray'> int32 (47398019, 5)
[[65536     1     1     1    17]
 [65536     1     1     1   153]
 [65536     1     1     2   137]
 [65536     1     1     3   153]
 [65536     1     1     9   124]
 [65536     1     1    13   377]
 [65536     1     1    13   134]
 [65536     1     1    13   137]
 [65536     1     1    13   153]
 [65536     1     1    13   439]]
[ 0  1  2  3  4 -1 -1 -1 -1  9]

Divakar · Accepted Answer · 2018-04-22T13:55:37.620

Approach #1

Inspired by this solution to Find the row indexes of several values in a numpy array, here's a vectorized solution using searchsorted -

def search2D_indices(X, searched_values, fillval=-1):
    dims = np.maximum(X.max(0), searched_values.max(0))+1
    X1D = np.ravel_multi_index(X.T,dims)
    searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]
    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

Sample run -

In [121]: U
Out[121]: 
array([[1, 4],
       [2, 5],
       [3, 6]])

In [122]: X
Out[122]: 
array([[1, 4],
       [3, 6],
       [7, 8],
       [1, 4]])

In [123]: search2D_indices(U, X, fillval=-1)
Out[123]: array([ 0,  2, -1,  0])

Approach #2

Extending to cases with negative ints, we need to offset dims and the conversion to 1D accordingly, like so -

def search2D_indices_v2(X, searched_values, fillval=-1):
    X_lim = X.max()-X.min(0)
    searched_values_lim = searched_values.max()-searched_values.min(0)

    dims = np.maximum(X_lim, searched_values_lim)+1
    s = dims.cumprod()

    X1D = X.dot(s)
    searched_valuesID = searched_values.dot(s)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]

    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

Sample run -

In [142]: U
Out[142]: 
array([[-1, -4],
       [ 2,  5],
       [ 3,  6]])

In [143]: X
Out[143]: 
array([[-1, -4],
       [ 3,  6],
       [ 7,  8],
       [-1, -4]])

In [144]: search2D_indices_v2(U, X, fillval=-1)
Out[144]: array([ 0,  2, -1,  0])

Approach #3

Another based on views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

def search2D_indices_views(X, searched_values, fillval=-1):
    X1D,searched_valuesID = view1D(X, searched_values)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]
    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

This approach fails for me, please look at my edited question — user2224350, Apr 22 '18 at 12:45
@user2224350 Yup was a bug indeed. Should be fixed in the edits just made. — Divakar, Apr 22 '18 at 13:56
hmm, I still get the same error. However, the third Approach works. This is really fast! Thanks alot! — user2224350, Apr 22 '18 at 15:30
@user2224350 That could be a case of overflow. Try converting the input(s) to `np.int64` dtype. — Divakar, Apr 22 '18 at 15:35

score 0 · Answer 2 · answered Apr 22 '18 at 10:14

0

This is a dictionary-based method:

import numpy as np

U = np.array([[1, 4],
              [2, 5],
              [3, 6]])

X = np.array([[1, 4],
              [3, 6],
              [7, 8],
              [1, 1]])

d = {v: k for k, v in enumerate(map(tuple, U))}

res = np.array([d.get(tuple(a), -1) for a in X])

# [ 0  2 -1 -1]

answered Apr 22 '18 at 10:14

jpp

159,742
34
281
339

I guess this will be way to slow, because of the for loops. X and U have around several 10^7 - 10^8 rows – user2224350 Apr 22 '18 at 10:27

Mazdak · Answer 3 · 2018-04-22T15:54:44.923

0

You can use broadcasting in order to determine the equity of the items in a vectorized manner. Afterward you can simply use all function over a proper axis to get the desire truth values correspond to expected indices. Finally, using np.where get the indices of where the equity happens and simply reassign it to a previously created array filled with -1.

In [47]: result = np.full(X.shape[0], -1)

In [48]: x, y = np.where((X[:,None] == U).all(-1))

In [49]: result[x] = y

In [50]: result
Out[50]: array([ 0,  2, -1,  0])

Note that as it's also mentioned in documentation, regard broad casting you should note that:

while this is very efficient in terms of lines of code, it may or may not be computationally efficient. The issue is the three-dimensional diff array calculated in an intermediate step of the algorithm. For small data sets, creating and operating on the array is likely to be very fast. However, large data sets will generate a large intermediate array that is computationally inefficient.

edited Apr 22 '18 at 15:54

answered Apr 22 '18 at 10:35

Mazdak

105,000
18
159
188

Why downvote? is there anything wrong with this answer that I can improve? – Mazdak Apr 22 '18 at 10:42
Just because the performance isn't the *best* isn't a reason to downvote. The point is the best answer rises to the top through upvotes. +1. – jpp Apr 22 '18 at 10:51
1

@Kasramvd Not me and agree with jpp! – Divakar Apr 22 '18 at 10:53
@jpp This solution is completely vectorized but definitely might not be the best/ Actually nothing is the best lol. But regardless of that this is violating the SO's rule of downvoting. One should only downvote if the answer is wrong or not related to the given question. – Mazdak Apr 22 '18 at 10:54
I could guess that @Divakar ;)) – Mazdak Apr 22 '18 at 10:56
This works fine for small Arrays. However for Arrays of size ~100k or bigger I get the following error message: "..\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future. if sys.path[0] == '': ... AttributeError: 'bool' object has no attribute 'all' " (in the np.where call) – user2224350 Apr 22 '18 at 12:51
@user2224350 Regarding that error It seems that you're using arrays in wrong order but about the performance for large arrays as it's mentioned in documentation, while this is very efficient in terms of lines of code, it may or may not be computationally efficient. The issue is the three-dimensional diff array calculated in an intermediate step of the algorithm. For small data sets, creating and operating on the array is likely to be very fast. However, large data sets will generate a large intermediate array that is computationally inefficient. – Mazdak Apr 22 '18 at 13:26
@Kasramvd Not my downvote but FWIW I really don't think this "violates SO's rule of downvoting". You can downvote whatever you please. And when the OP's question's first line is "I have 2 huge NumPy arrays" your solution doesn't look promising. – miradulo Apr 22 '18 at 14:37
1

@miradulo It does but that's not the point. You're right about *you can downvote whatever you please* and that's an obvious fact you **can** but that doesn't mean you should. Another point is that this solution is a straight and more understandable than other Numpy-based approach (no offense intended tho) and answer the question quite right but not in a blazing fast manner. This code has many advantages that one of which could be problem solving in a vectorized way plus advantages and disadvantages of broadcasting. – Mazdak Apr 22 '18 at 15:49
Sure. My point is just that your answer patently will not work for large arrays. Thus claiming someone is breaking rules by downvoting you is strange. – miradulo Apr 22 '18 at 17:01

For each row of a 2d numpy array get the index of an equal row in a second 2d array

3 Answers3