3

I have two huge 2d numpy integer arrays X and U, where U is assumed to have only unqiue rows. For each row in X I would like to get the corresponding row index of the matching row in U (if there is one, otherwise -1). E.g., if the following arrays are passed as inputs:

U = array([[1, 4],
       [2, 5],
       [3, 6]])

X = array([[1, 4],
       [3, 6],
       [7, 8],
       [1, 4]])

the output should be:

array([0,2,-1,0])

Is there an efficient way of doing this (or something similar) with Numpy?

@ Divakar: Your approach fails for me

print(type(rows), rows.dtype, rows.shape)
print(rows[:10])
print(search2D_indices(rows[:10], rows[:10]))

<class 'numpy.ndarray'> int32 (47398019, 5)
[[65536     1     1     1    17]
 [65536     1     1     1   153]
 [65536     1     1     2   137]
 [65536     1     1     3   153]
 [65536     1     1     9   124]
 [65536     1     1    13   377]
 [65536     1     1    13   134]
 [65536     1     1    13   137]
 [65536     1     1    13   153]
 [65536     1     1    13   439]]
[ 0  1  2  3  4 -1 -1 -1 -1  9]
user2224350
  • 2,262
  • 5
  • 28
  • 54

3 Answers3

3

Approach #1

Inspired by this solution to Find the row indexes of several values in a numpy array , here's a vectorized solution using searchsorted -

def search2D_indices(X, searched_values, fillval=-1):
    dims = np.maximum(X.max(0), searched_values.max(0))+1
    X1D = np.ravel_multi_index(X.T,dims)
    searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]
    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

Sample run -

In [121]: U
Out[121]: 
array([[1, 4],
       [2, 5],
       [3, 6]])

In [122]: X
Out[122]: 
array([[1, 4],
       [3, 6],
       [7, 8],
       [1, 4]])

In [123]: search2D_indices(U, X, fillval=-1)
Out[123]: array([ 0,  2, -1,  0])

Approach #2

Extending to cases with negative ints, we need to offset dims and the conversion to 1D accordingly, like so -

def search2D_indices_v2(X, searched_values, fillval=-1):
    X_lim = X.max()-X.min(0)
    searched_values_lim = searched_values.max()-searched_values.min(0)

    dims = np.maximum(X_lim, searched_values_lim)+1
    s = dims.cumprod()

    X1D = X.dot(s)
    searched_valuesID = searched_values.dot(s)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]

    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)

Sample run -

In [142]: U
Out[142]: 
array([[-1, -4],
       [ 2,  5],
       [ 3,  6]])

In [143]: X
Out[143]: 
array([[-1, -4],
       [ 3,  6],
       [ 7,  8],
       [-1, -4]])

In [144]: search2D_indices_v2(U, X, fillval=-1)
Out[144]: array([ 0,  2, -1,  0])

Approach #3

Another based on views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

def search2D_indices_views(X, searched_values, fillval=-1):
    X1D,searched_valuesID = view1D(X, searched_values)
    sidx = X1D.argsort()
    idx = np.searchsorted(X1D,searched_valuesID,sorter=sidx)
    idx[idx==len(sidx)] = 0    
    idx_out = sidx[idx]
    return np.where(X1D[idx_out] == searched_valuesID, idx_out, fillval)
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

This is a dictionary-based method:

import numpy as np

U = np.array([[1, 4],
              [2, 5],
              [3, 6]])

X = np.array([[1, 4],
              [3, 6],
              [7, 8],
              [1, 1]])

d = {v: k for k, v in enumerate(map(tuple, U))}

res = np.array([d.get(tuple(a), -1) for a in X])

# [ 0  2 -1 -1]
jpp
  • 159,742
  • 34
  • 281
  • 339
0

You can use broadcasting in order to determine the equity of the items in a vectorized manner. Afterward you can simply use all function over a proper axis to get the desire truth values correspond to expected indices. Finally, using np.where get the indices of where the equity happens and simply reassign it to a previously created array filled with -1.

In [47]: result = np.full(X.shape[0], -1)

In [48]: x, y = np.where((X[:,None] == U).all(-1))

In [49]: result[x] = y

In [50]: result
Out[50]: array([ 0,  2, -1,  0])

Note that as it's also mentioned in documentation, regard broad casting you should note that:

while this is very efficient in terms of lines of code, it may or may not be computationally efficient. The issue is the three-dimensional diff array calculated in an intermediate step of the algorithm. For small data sets, creating and operating on the array is likely to be very fast. However, large data sets will generate a large intermediate array that is computationally inefficient.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Why downvote? is there anything wrong with this answer that I can improve? – Mazdak Apr 22 '18 at 10:42
  • Just because the performance isn't the *best* isn't a reason to downvote. The point is the best answer rises to the top through upvotes. +1. – jpp Apr 22 '18 at 10:51
  • 1
    @Kasramvd Not me and agree with jpp! – Divakar Apr 22 '18 at 10:53
  • @jpp This solution is completely vectorized but definitely might not be the best/ Actually nothing is the best lol. But regardless of that this is violating the SO's rule of downvoting. One should only downvote if the answer is wrong or not related to the given question. – Mazdak Apr 22 '18 at 10:54
  • I could guess that @Divakar ;)) – Mazdak Apr 22 '18 at 10:56
  • This works fine for small Arrays. However for Arrays of size ~100k or bigger I get the following error message: "..\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: DeprecationWarning: elementwise == comparison failed; this will raise an error in the future. if sys.path[0] == '': ... AttributeError: 'bool' object has no attribute 'all' " (in the np.where call) – user2224350 Apr 22 '18 at 12:51
  • @user2224350 Regarding that error It seems that you're using arrays in wrong order but about the performance for large arrays as it's mentioned in documentation, while this is very efficient in terms of lines of code, it may or may not be computationally efficient. The issue is the three-dimensional diff array calculated in an intermediate step of the algorithm. For small data sets, creating and operating on the array is likely to be very fast. However, large data sets will generate a large intermediate array that is computationally inefficient. – Mazdak Apr 22 '18 at 13:26
  • @Kasramvd Not my downvote but FWIW I really don't think this "violates SO's rule of downvoting". You can downvote whatever you please. And when the OP's question's first line is "I have 2 huge NumPy arrays" your solution doesn't look promising. – miradulo Apr 22 '18 at 14:37
  • 1
    @miradulo It does but that's not the point. You're right about *you can downvote whatever you please* and that's an obvious fact you **can** but that doesn't mean you should. Another point is that this solution is a straight and more understandable than other Numpy-based approach (no offense intended tho) and answer the question quite right but not in a blazing fast manner. This code has many advantages that one of which could be problem solving in a vectorized way plus advantages and disadvantages of broadcasting. – Mazdak Apr 22 '18 at 15:49
  • Sure. My point is just that your answer patently will not work for large arrays. Thus claiming someone is breaking rules by downvoting you is strange. – miradulo Apr 22 '18 at 17:01