Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:
A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]
print(repr(A))
# array([[ 7, 11, 6],
# [ 4, 10, 8],
# [ 9, 2, 0],
# [ 1, 3, 5]])
print(repr(B))
# array([[ 1, 3, 5],
# [ 4, 10, 8],
# [ 7, 11, 6],
# [ 9, 2, 0]])
We want to recover a set of indices, idx
, such that A[idx] == B
. This will be a unique mapping if and only if A and B contain no repeated rows.
One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view A
and B
as 1D arrays using an np.void
dtype that treats each row as a single element:
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a) # indices that sort the rows of A in lexical order
Now we can use np.searchsorted
to perform a binary search for where each row in B would fall within the sorted version of A:
# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)
The mapping from A→B can be expressed as a composite of A→As→B
a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True
If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using
b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
As a single function:
def find_row_mapping(A, B):
"""
Given A and B, where B is a copy of A permuted over the first dimension, find
a set of indices idx such that A[idx] == B.
This is a unique mapping if and only if there are no repeated rows in A and B.
Arguments:
A, B: n-dimensional arrays with same shape and dtype
Returns:
idx: vector of indices into the rows of A
"""
if not (A.shape == B.shape):
raise ValueError('A and B must have the same shape')
if not (A.dtype == B.dtype):
raise TypeError('A and B must have the same dtype')
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a)
as_to_b = a.searchsorted(b, sorter=a_to_as)
return a_to_as.take(as_to_b)
Benchmark:
In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop
*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.