2

I have a large 2xn array A and a smaller 2xn array B. All columns in B can be found in A. I'm looking to find the indices of A by matching columns in B. For example,

import numpy

A = numpy.array([
    [101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
    [102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]
])

B = numpy.array([
    [101, 103, 109],
    [102, 224, 212]
])

The answer that I'm looking for is [0,6,14]. Interested to know if there is an efficient way rather than looping. Thanks!

Veedrac
  • 58,273
  • 15
  • 112
  • 169
sgoh
  • 23
  • 4

4 Answers4

4

There is hardly a good answer for your problem: numpy is not very well suited for this type of problems, although it can be done. To do subarray searches, if your dtype is not floating point, the method here is probably your best bet. You would start with something like:

AA = np.ascontiguousarray(A.T)
BB = np.ascontiguousarray(B.T)

dt = np.dtype((np.void, AA.dtype.itemsize * AA.shape[1]))
AA = AA.view(dt).ravel()
BB = BB.view(dt).ravel()

And now it is just about searching for the items in a 1D array in another 1D array, which is pretty straightforward, assuming there are no repeated columns in the original A array.

If either of your arrays is really small, as in your example, it is going to be hard to beat something like:

indices = np.argmax(AA == BB[:, None], axis = 1)

But for larger datasets, it is going to be hard to beat a sorting approach:

sorter = np.argsort(AA)
sorted_indices = np.searchsorted(AA, BB, sorter=sorter)
indices = sorter[sorted_indices]
Community
  • 1
  • 1
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Thanks @Jaime, my arrays are not too big but there may be repeated columns. – sgoh Jun 14 '14 at 12:22
  • Do you need to know where does each column of `B` come up in `A`? Or do you simply need to know which columns of `A` are also in `B`? – Jaime Jun 14 '14 at 14:52
1

Here's a way, given the arrays are pre-sorted:

import numpy

A = numpy.array([
    [101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
    [102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]
])

B = numpy.array([
    [101, 103, 109],
    [102, 224, 212]
])

def search2D(A, B):
    to_find_and_bounds = zip(
        B[1],
        numpy.searchsorted(A[0], B[0], side="left"),
        numpy.searchsorted(A[0], B[0], side="right")
    ) 

    for to_find, left, right in to_find_and_bounds:
        offset = numpy.searchsorted(A[1, left:right], to_find)
        yield offset + left

list(search2D(A, B))
#>>> [0, 6, 14]

This is O(len B · log len A).

For unsorted arrays, you can perform an indirect sort:

sorter = numpy.lexsort(A[::-1])
sorted_copy = A.T[sorter].T

sorter[list(search2D(sorted_copy, B))]
#>>> array([ 3,  6, 14])

If you need multiple results from one index, try

for to_find, left, right in to_find_and_bounds:
    offset_left = numpy.searchsorted(A[1, left:right], to_find, side="left")
    offset_right = numpy.searchsorted(A[1, left:right], to_find, side="right")
    yield from range(offset_left + left, offset_right + left)
Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • @sorry, I had just deleted the comment seeing your edit – gg349 Jun 13 '14 at 13:26
  • Thanks @Veedrac! This is very helpful. Being a novice, could you advise what changes are required if both arrays are 1D? – sgoh Jun 14 '14 at 12:19
  • 1
    @sgoh If both are 1D, just `for _, left, right in to_find_and_bounds: yield from range(left, right)`. – Veedrac Jun 14 '14 at 12:23
0

You could use a string-based comparison such as this one using np.char.array

ca = np.char.array(a)[0,:] + np.char.array(a)[1,:]
cb = np.char.array(b)[0,:] + np.char.array(b)[1,:]
np.where(np.in1d(ca, cb))[0]
#array([ 0,  6, 14], dtype=int64)

EDIT:

you can also manipulate the array dtype in order to transform the a array to one with shape=(18,) where each element contains the data of the two elements of the corresponding column. The same idea can be applied to array b, obtaining shape=(3,). Then you use np.where(np.in1d()) to get the indices:

nrows = a.shape[0]
ta = np.ascontiguousarray(a.T).view(np.dtype((np.void, a.itemsize*nrows))).flatten()
tb = np.ascontiguousarray(b.T).view(np.dtype((np.void, b.itemsize*nrows))).flatten()
np.where(np.in1d(ta, tb))[0]
#array([ 0,  6, 14], dtype=int64)

The idea is similar to the string-based approach.

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
-2

Numpy has all you need. I assume the arrays are not sorted, conversly you can improve the following code as you prefer:

import numpy as np

a = np.array([[101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
     [102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]])
b = np.array([[101, 103, 109],
     [102, 224, 212]])

idxs = []

for i in range(np.shape(b)[1]):
    for j in range(np.shape(a)[1]):
        if np.array_equal(b[:,i],a[:,j]):
            idxs.append(j)

print idxs
Emisilve86
  • 354
  • 2
  • 8