3

I am trying to figure out a better way to check if two 2D arrays contain the same rows. Take the following case for a short example:

>>> a
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> b
array([[6, 7, 8],
       [3, 4, 5],
       [0, 1, 2]])

In this case b=a[::-1]. To check if two rows are equal:

>>>a=a[np.lexsort((a[:,0],a[:,1],a[:,2]))]
>>>b=b[np.lexsort((b[:,0],b[:,1],b[:,2]))]
>>> np.all(a-b==0)
True

This is great and fairly fast. However the issue comes about when two rows are "close":

array([[-1.57839867  2.355354   -1.4225235 ],
       [-0.94728367  0.         -1.4225235 ],
       [-1.57839867 -2.355354   -1.4225215 ]]) <---note ends in 215 not 235
array([[-1.57839867 -2.355354   -1.4225225 ],
       [-1.57839867  2.355354   -1.4225225 ],
       [-0.94728367  0.         -1.4225225 ]])

Within a tolerance of 1E-5 these two arrays are equal by row, but the lexsort will tell you otherwise. This can be solved by a different sorting order but I would like a more general case.

I was toying with the idea of:

a=a.reshape(-1,1,3)
>>> a-b
array([[[-6, -6, -6],
        [-3, -3, -3],
        [ 0,  0,  0]],

       [[-3, -3, -3],
        [ 0,  0,  0],
        [ 3,  3,  3]],

       [[ 0,  0,  0],
        [ 3,  3,  3],
        [ 6,  6,  6]]])
>>> np.all(np.around(a-b,5)==0,axis=2)
array([[False, False,  True],
       [False,  True, False],
       [ True, False, False]], dtype=bool)
>>>np.all(np.any(np.all(np.around(a-b,5)==0,axis=2),axis=1))
True

This doesn't tell you if the arrays are equal by row just if all points in b are close to a value in a. The number of rows can be several hundred and I need to do it quite a bit. Any ideas?

Daniel
  • 19,179
  • 7
  • 60
  • 74
  • 1
    I will throw in `scipy.spatial.cKDTree` (possibly KDTree, depending on scipy version and usage), for an approach, which might be more straight forward. – seberg Feb 19 '13 at 23:16
  • This is exactly what I was looking for. Knew there had to be a better way. – Daniel Feb 19 '13 at 23:54

1 Answers1

1

Your last code doesn't do what you think it is doing. What it tells you is whether every row in b is close to a row in a. If you change the axis you use for the outer calls to np.any and np.all, you could check whether every row in a is close to some row in b. If both every row in b is close to a row in a, and every row in a is close to a row in b, then the sets are equal. Probably not very computationally efficient, but probably very fast in numpy for moderately sized arrays:

def same_rows(a, b, tol=5) :
    rows_close = np.all(np.round(a - b[:, None], tol) == 0, axis=-1)
    return (np.all(np.any(rows_close, axis=-1), axis=-1) and
            np.all(np.any(rows_close, axis=0), axis=0))

>>> rows, cols = 5, 3
>>> a = np.arange(rows * cols).reshape(rows, cols)
>>> b = np.arange(rows)
>>> np.random.shuffle(b)
>>> b = a[b]
>>> a
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
>>> b
array([[ 9, 10, 11],
       [ 3,  4,  5],
       [ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
>>> same_rows(a, b)
True
>>> b[0] = b[1]
>>> b
array([[ 3,  4,  5],
       [ 3,  4,  5],
       [ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])
>>> same_rows(a, b) # not all rows in a are close to a row in b
False

And for not too big arrays, performance is reasonable, even though it is having to build an array of (rows, rows, cols):

In [2]: rows, cols = 1000, 10

In [3]: a = np.arange(rows * cols).reshape(rows, cols)

In [4]: b = np.arange(rows)

In [5]: np.random.shuffle(b)

In [6]: b = a[b]

In [7]: %timeit same_rows(a, b)
10 loops, best of 3: 103 ms per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • I did mention that was one issue with the code the I posted. This is essentially I ended up writing with a few additional arguments. I added in a distance formula to give a better idea of how close a point was and used the lexsort method to drastically reduce the number of rows that needed to be passed to this type of check. I will check your answer if nobody comes up with a better idea by tomorrow. – Daniel Feb 19 '13 at 22:35