Numpy find indices of matching columns

Question

I have a large 2xn array A and a smaller 2xn array B. All columns in B can be found in A. I'm looking to find the indices of A by matching columns in B. For example,

import numpy

A = numpy.array([
    [101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
    [102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]
])

B = numpy.array([
    [101, 103, 109],
    [102, 224, 212]
])

The answer that I'm looking for is [0,6,14]. Interested to know if there is an efficient way rather than looping. Thanks!

Hint: numpy arrays support column slicing. Check [this](http://wiki.scipy.org/Tentative_NumPy_Tutorial#head-864862d3f2bb4c32f04260fac61eb4ef34788c4c) — Jaime Rodríguez-Guerra, Jun 13 '14 at 12:23

score 4 · Answer 1 · edited May 23 '17 at 12:14

There is hardly a good answer for your problem: numpy is not very well suited for this type of problems, although it can be done. To do subarray searches, if your dtype is not floating point, the method here is probably your best bet. You would start with something like:

AA = np.ascontiguousarray(A.T)
BB = np.ascontiguousarray(B.T)

dt = np.dtype((np.void, AA.dtype.itemsize * AA.shape[1]))
AA = AA.view(dt).ravel()
BB = BB.view(dt).ravel()

And now it is just about searching for the items in a 1D array in another 1D array, which is pretty straightforward, assuming there are no repeated columns in the original A array.

If either of your arrays is really small, as in your example, it is going to be hard to beat something like:

indices = np.argmax(AA == BB[:, None], axis = 1)

But for larger datasets, it is going to be hard to beat a sorting approach:

sorter = np.argsort(AA)
sorted_indices = np.searchsorted(AA, BB, sorter=sorter)
indices = sorter[sorted_indices]

Thanks @Jaime, my arrays are not too big but there may be repeated columns. — sgoh, Jun 14 '14 at 12:22
Do you need to know where does each column of `B` come up in `A`? Or do you simply need to know which columns of `A` are also in `B`? — Jaime, Jun 14 '14 at 14:52

Veedrac · Accepted Answer · 2014-06-13T13:22:38.107

Here's a way, given the arrays are pre-sorted:

import numpy

A = numpy.array([
    [101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
    [102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]
])

B = numpy.array([
    [101, 103, 109],
    [102, 224, 212]
])

def search2D(A, B):
    to_find_and_bounds = zip(
        B[1],
        numpy.searchsorted(A[0], B[0], side="left"),
        numpy.searchsorted(A[0], B[0], side="right")
    ) 

    for to_find, left, right in to_find_and_bounds:
        offset = numpy.searchsorted(A[1, left:right], to_find)
        yield offset + left

list(search2D(A, B))
#>>> [0, 6, 14]

This is O(len B · log len A).

For unsorted arrays, you can perform an indirect sort:

sorter = numpy.lexsort(A[::-1])
sorted_copy = A.T[sorter].T

sorter[list(search2D(sorted_copy, B))]
#>>> array([ 3,  6, 14])

If you need multiple results from one index, try

for to_find, left, right in to_find_and_bounds:
    offset_left = numpy.searchsorted(A[1, left:right], to_find, side="left")
    offset_right = numpy.searchsorted(A[1, left:right], to_find, side="right")
    yield from range(offset_left + left, offset_right + left)

Thanks @Veedrac! This is very helpful. Being a novice, could you advise what changes are required if both arrays are 1D? — sgoh, Jun 14 '14 at 12:19
@sgoh If both are 1D, just `for _, left, right in to_find_and_bounds: yield from range(left, right)`. — Veedrac, Jun 14 '14 at 12:23

Saullo G. P. Castro · Answer 3 · 2014-06-13T13:19:34.483

0

You could use a string-based comparison such as this one using np.char.array

ca = np.char.array(a)[0,:] + np.char.array(a)[1,:]
cb = np.char.array(b)[0,:] + np.char.array(b)[1,:]
np.where(np.in1d(ca, cb))[0]
#array([ 0,  6, 14], dtype=int64)

EDIT:

you can also manipulate the array dtype in order to transform the a array to one with shape=(18,) where each element contains the data of the two elements of the corresponding column. The same idea can be applied to array b, obtaining shape=(3,). Then you use np.where(np.in1d()) to get the indices:

nrows = a.shape[0]
ta = np.ascontiguousarray(a.T).view(np.dtype((np.void, a.itemsize*nrows))).flatten()
tb = np.ascontiguousarray(b.T).view(np.dtype((np.void, b.itemsize*nrows))).flatten()
np.where(np.in1d(ta, tb))[0]
#array([ 0,  6, 14], dtype=int64)

The idea is similar to the string-based approach.

edited Jun 13 '14 at 13:19

answered Jun 13 '14 at 12:26

Saullo G. P. Castro

56,802
26
179
234

1

With `np.in1d` you lose track of which column of `B` corresponds to which column of `A`. This may not be a requirement of the OP, but it is worth noting. – Jaime Jun 13 '14 at 13:27
@Jaime thank you... as you said, for what the OP needs `np.in1d()` is more straightforward – Saullo G. P. Castro Jun 13 '14 at 14:11
Thanks @SaulloCastro, you have given me some good ideas to explore. – sgoh Jun 14 '14 at 12:20

score -2 · Answer 4 · answered Jun 13 '14 at 12:48

-2

Numpy has all you need. I assume the arrays are not sorted, conversly you can improve the following code as you prefer:

import numpy as np

a = np.array([[101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
     [102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]])
b = np.array([[101, 103, 109],
     [102, 224, 212]])

idxs = []

for i in range(np.shape(b)[1]):
    for j in range(np.shape(a)[1]):
        if np.array_equal(b[:,i],a[:,j]):
            idxs.append(j)

print idxs

answered Jun 13 '14 at 12:48

Emisilve86

354
2
8

3

This is not a numpy solution, you could do the same with simple python lists as input. – gg349 Jun 13 '14 at 12:49
1

Plus this is `O(n²)` and slow. – Veedrac Jun 13 '14 at 12:49
Read above: "I assume arrays are not sorted". I cannot avoid O(n^2). – Emisilve86 Jun 13 '14 at 12:53
1

Sorting is `O(n log n)`, so no, you can. – Veedrac Jun 13 '14 at 12:54
Yes, I know, but in this way indices change on both arrays. – Emisilve86 Jun 13 '14 at 12:59

Numpy find indices of matching columns

4 Answers4