1

I have two 2D arrays a and b. I want to find the exact indices of a in b. I followed the solution proposed here.

The problem is that my arrays contain duplicates as you can see here:

# The shape of b is (50, 2)
b = np.array([[ 0,  1],[ 2,  3],[ 4,  5],[ 6,  7], [ 0,  1],
             [10, 11], [12, 13], [14, 15], [16, 17], [10, 11],
             [20, 21], [22, 23], [24, 25], [26, 27], [20, 21],
             [30, 31], [32, 33], [34, 35], [36, 37], [30, 31],
             [40, 41], [42, 43], [44, 45], [46, 47], [40, 41],
             [50, 51], [52, 53], [54, 55], [56, 57], [50, 51],
             [60, 61], [62, 63], [64, 65], [66, 67], [60, 61],
             [70, 71], [72, 73], [74, 75], [76, 77], [70, 71],
             [80, 81], [82, 83], [84, 85], [86, 87], [80, 81],
             [90, 91], [92, 93], [94, 95], [96, 97], [90, 91]])

# The shape of a is (20,2)
a = np.array([[ 0,  1],[ 2,  3], [ 4,  5],[ 6,  7],[ 0,  1],
       [50, 51],[52, 53], [54, 55], [56, 57], [50, 51],
       [20, 21], [22, 23], [24, 25], [26, 27], [20, 21],
       [70, 71], [72, 73], [74, 75], [76, 77], [70, 71]])

Now when I try something like this:

# See the link above approach 2
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

def argwhere_nd_searchsorted(a,b):
    A,B = view1D(a,b)
    sidxB = B.argsort()
    mask = np.isin(A,B)
    cm = A[mask]
    idx0 = np.flatnonzero(mask)
    idx1 = sidxB[np.searchsorted(B,cm, sorter=sidxB)]
    return idx0, idx1 # idx0 : indices in A, idx1 : indices in B

args0, args1 = argwhere_nd_searchsorted(a,b)

result in:

#args0
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19])

#args1
 array([ 0,
  1,
  2,
  3,
  0, # this sould be 4
 25,
 26,
 27,
 28,
 25, # this sould be 29
 10,
 11,
 12,
 13,
 10,# this should be 14
 39,# this should be 35
 36,
 37,
 38,
 39])
# if we check
np.equal(b[args1],a).all() # This returns True

As you can see, the problem in args1 the highlighted indices are repeated. My expected result is shown in the commented lines.

Any help is appreciated

  • There are duplicates. So, how would do a match? How would you decide which one to match against? – Divakar Nov 06 '19 at 16:07
  • @Divakar: Suppose that I want to update the array `b` as follows: `b[args1] = another array` this will update one instance of the duplicate points twice and leave the others unchanged. Some I am looking for the solution to this issue. –  Nov 06 '19 at 16:17
  • Don't think you are getting my point. You are asking - `I want to match rows of array a in b and get a row indices map using numpy.`. Now there are duplicate rows in both a and b. Hence, my earlier comment. – Divakar Nov 06 '19 at 16:19
  • @Divakar: I updated the question. I mean, in the code `args1` shouldn't contain the index of the firs duplicated point repeated –  Nov 06 '19 at 16:24
  • Not sure, but your problem might be solved with a mask. So, you can use [`isin_nd`](https://stackoverflow.com/a/54792426/) to get the mask, which can be used to mask and assign into `b`. – Divakar Nov 06 '19 at 16:26
  • @Divakar: I tried that but it returns a wrong result –  Nov 06 '19 at 16:34

1 Answers1

0

We could add one more column of IDs to represent duplicates within the rows and then use the same steps. We will use pandas to get those IDs, it's just easier that way. Hence, simply do -

import pandas as pd

def assign_duplbl(a):
    df = pd.DataFrame(a)
    df['num'] = 1
    return df.groupby(list(range(a.shape[1]))).cumsum().values

a1 = np.hstack((a,assign_duplbl(a)))
b1 = np.hstack((b,assign_duplbl(b)))
args0, args1 = argwhere_nd_searchsorted(a1,b1)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • The first version of the answer works fine. But with edited one, I get an error: `range' object is not callable` –  Nov 06 '19 at 17:24
  • 1
    @IamNotaMathematician Check out the edited code please. – Divakar Nov 06 '19 at 17:26
  • As a suggestion, you could also consider enhancing your answer [here](https://stackoverflow.com/questions/55612617/match-rows-of-two-2d-arrays-and-get-a-row-indices-map-using-numpy) to account for duplicates. –  Nov 06 '19 at 17:50
  • @IamNotaMathematician Well the way to handle duplicates could be case-specific. Like someone might want to skip after the first duplicate is encountered. In your case, you want to consider the sequence in which they appear. So, I will leave at it. It's already linked to that Q&A through the question. So, that's good enough I think. – Divakar Nov 06 '19 at 17:53
  • Yes, I agree, but in my case I have a version of your `argwhere_nd_sortedsearch` function with an optional argument `consider_duplicated` which defaults to `False`. –  Nov 06 '19 at 17:59
  • @IamNotaMathematician That's okay. You can keep whichever modified version works for you. But, I will leave at it where OP can decide what to do what the duplicates. This post lists one option. Others could be explored as they come through. – Divakar Nov 06 '19 at 18:02