1

Gvien two arrays with identical entries but different orders, is there a way to find the index mapping which would rearrange one array such that it is equivalent to the other?

i.e.

x1 = np.array([[1,2],
               [3,4],
               [5,6],
               [7,8],
               [9,10]])

x2 = np.array([[3,4],
               [7,8],
               [1,2],
               [5,6],
               [9,10]])

The mapping from x1 to x2 would be np.array([1,3,0,2,4]), and the mapping from x2 to x1 would be np.array([2,0,3,1,4])

Note, in this example x1 is sorted, however I do not wish to make this assumption in general.

jonnybolton16
  • 193
  • 2
  • 11
  • In addition to my review of possible ways to solve this problem in `numpy`, you might like to check out `numpy_indexed` module. It has a method called `indices` dedicated for solution of problems like yours. – mathfux Feb 25 '22 at 05:33
  • There are also quite a lot related questions [like this](https://stackoverflow.com/questions/42232540/how-to-find-indices-of-a-reordered-numpy-array/42235354#42235354) if you use `npi.indexed` as a keyword in SO search. – mathfux Feb 25 '22 at 05:43

4 Answers4

1

Assuming the first array's elements are sorted as in the example, an efficient option would be to view the two-column arrays as 1d arrays of tuples, and use np.searchsorted to find the indices where these tuples are in x2_view:

x1_view = x1.view(dtype='i,i').ravel()
x2_view = x2.view(dtype='i,i').ravel()

np.searchsorted(x1_view, x2_view)
# array([1, 3, 0, 2, 4], dtype=int64)

In the case the first array is not sorted, you can pass np.searchsorted a sorter:

x1_sorted_view = np.argsort(x1_view)
np.searchsorted(x1_view, x2_view, sorter=x1_sorted_view)

For the second case, you only need np.argsort on x2:

np.argsort(x2_view)
# array([2, 0, 3, 1, 4], dtype=int64)
yatu
  • 86,083
  • 12
  • 84
  • 139
  • Thanks. Perhaps I used a poor example because I don't want to assume either is sorted... I could just sort both using argsort/lexsort but I'm trying to avoid that. – jonnybolton16 Feb 23 '22 at 15:03
  • Then this solution could work if you used argsort for the first array, since searchsorted requires the first array to be sorted. But there aren't many efficient alternatives to this without using broadcasting @jonnybolton16 – yatu Feb 23 '22 at 15:23
  • `np.searchsorted` it's not working for me as expected. It returns array of 10 indices instead of 5 – mathfux Feb 23 '22 at 15:35
  • @jonnybolton16 Try using `sorter` in `searchsorted` – yatu Feb 23 '22 at 15:55
  • Should work if your using the views @mathfux – yatu Feb 23 '22 at 15:56
  • Well, it's still not working for me. It returns `array([2, 3, 6, 7, 0, 1, 4, 5, 8, 9])` instead of `array([1, 3, 0, 2, 4], dtype=int64)`. Might there be some changes in current versions of `numpy`? I'm using `1.20.3` – mathfux Feb 23 '22 at 16:28
  • Finally, I managed to solve my issue. For some reasons `x1` and `x2` results in arrays of `dtype = np.int64`. I'm using Linux currently and this is unexpected in comparison of `numpy` behaviour when I was using Windows (default `dtype` was `np.int32` before). A workaround in my case would be to use `dtype='i8,i8'` in views of `x1` and `x2`. Therefore in general, it is not safe and you need to use `dtype = np.dtype([('', x1.dtype), ('', x1.dtype)])` – mathfux Feb 23 '22 at 20:28
0

There are several ways to manipulate index sets of x1 and x2 in order to arrive at the mappings you need. A use case could be summarised as follows:

def find_mappings(sidx1, sidx2):
    '''Input: two set of indices that sorts arrays x1 and x2
       Output: both maps x1->x2 and x2->x1'''
    x1_to_x2 = np.empty(len(x1), dtype=x1.dtype)
    x2_to_x1 = np.empty(len(x1), dtype=x1.dtype)
    x1_to_x2[sidx2] = sidx1
    x2_to_x1[sidx1] = sidx2
    return x1_to_x2, x2_to_x1

I will review the most common ways to pass two index sets sidx1 and sidx2

Way 1 using np.lexsort as sorting indices:

sidx1 = np.lexsort(x1.T[::-1])
sidx2 = np.lexsort(x2.T[::-1])

find_mappings(sidx1, sidx2)
>>> (array([1, 3, 0, 2, 4]), array([2, 0, 3, 1, 4]))

Way 2 using np.argsort of reduced dimensions as sorting indices:

x1_dimreduce = np.ravel_multi_index(x1.T, np.max(x1, axis=0)+1)
x2_dimreduce = np.ravel_multi_index(x2.T, np.max(x2, axis=0)+1)
sidx1 = np.argsort(x1_dimreduce)
sidx2 = np.argsort(x2_dimreduce)

find_mappings(sidx1, sidx2)
>>> (array([1, 3, 0, 2, 4]), array([2, 0, 3, 1, 4]))

Another variation of way 2 (reducing dimensions by hands works faster and could be optimised further in numba)

x1_dimreduce = x1[:,0] * (np.max(x1[:,1])+1) + x1[:,1]
x2_dimreduce = x2[:,0] * (np.max(x2[:,1])+1) + x2[:,1]

Way 3 using np.searchsorted in order to find x1_to_x2 and then find inverse mapping x2_to_x1 automatically (credits to @yatu as well):

x1_view = x1.view(dtype=np.dtype([('', x1.dtype), ('', x1.dtype)])).ravel()
x2_view = x2.view(dtype=np.dtype([('', x2.dtype), ('', x2.dtype)])).ravel()

argidx = np.argsort(x1_view)
sidx = np.searchsorted(x1_view, x2_view, sorter=argidx)

sidx1 = argidx[sidx]
sidx2 = np.arange(len(sidx))

find_mappings(sidx1, sidx2)
>>> (array([1, 3, 0, 2, 4]), array([2, 0, 3, 1, 4]))

Another variation of way 3 (simplification of way 3 using sidx and argidx arguments instead of sidx1 and sidx2)

def find_mappings2(sidx, argidx):
    '''reduces some excessive commands met in `find_mapping`'''
    x1_to_x2 = argidx[sidx]
    x2_to_x1 = np.empty(len(x1), dtype=x1.dtype)
    x2_to_x1[x1_to_x2] = np.arange(len(sidx))
    return x1_to_x2, x2_to_x1

x1_view = x1.view(dtype=np.dtype([('', x1.dtype), ('', x1.dtype)])).ravel()
x2_view = x2.view(dtype=np.dtype([('', x2.dtype), ('', x2.dtype)])).ravel()

argidx = np.argsort(x1_view)
sidx = np.searchsorted(x1_view, x2_view, sorter=argidx)

np.array_equal(find_mapping(sidx1, sidx2), find_mapping2(sidx, argidx))
>>> True

Note that every of these ways could be minimized in case any of arrays x1 and x2 preserves order.

Use case for better understanding

x1_to_x2, x2_to_x1 = find_mapping(sidx1, sidx2)
>>> np.array_equal(x1[x1_to_x2], x2)
True
>>> np.array_equal(x2[x2_to_x1], x1)
True

Runtime

Variant of Way 2 appears to be a winner. Moreover, it's open for further optimizations:

import benchit
%matplotlib inline
benchit.setparams(rep=5)

sizes = [3, 10, 30, 100, 300, 900, 3000, 9000, 30000, 90000, 300000, 900000]

def _generate_x1_x2(N):
    '''return two randomly ordered arrays x1 and x2 of length N'''
    arr = np.transpose(np.unravel_index(np.arange(N), (int(N**0.5)+1, int(N**0.5)+1)))
    rng = np.random.default_rng()
    return arr[rng.permutation(N)], arr[rng.permutation(N)]

def _find_mappings(sidx1, sidx2):
    x1_to_x2, x2_to_x1 = np.empty(len(sidx1), dtype=sidx1.dtype), np.empty(len(sidx2), dtype=sidx2.dtype)
    x1_to_x2[sidx2] = sidx1
    x2_to_x1[sidx1] = sidx2
    return x1_to_x2, x2_to_x1

def _find_mappings2(sidx, argidx):
    x1_to_x2 = argidx[sidx]
    x2_to_x1 = np.empty(len(x1_to_x2), dtype=x1_to_x2.dtype)
    x2_to_x1[x1_to_x2] = np.arange(len(sidx))
    return x1_to_x2, x2_to_x1

def lexsort_map(x1, x2):
    sidx1 = np.lexsort(x1.T[::-1])
    sidx2 = np.lexsort(x2.T[::-1])
    return _find_mappings(sidx1, sidx2)

def argsort_map(x1, x2):
    x1_dimreduce = np.ravel_multi_index(x1.T, np.max(x1, axis=0)+1)
    x2_dimreduce = np.ravel_multi_index(x2.T, np.max(x2, axis=0)+1)
    sidx1 = np.argsort(x1_dimreduce)
    sidx2 = np.argsort(x2_dimreduce)
    return _find_mappings(sidx1, sidx2)

def argsort_map_v2(x1, x2):
    x1_dimreduce = x1[:,0] * (np.max(x1[:,1])+1) + x1[:,1]
    x2_dimreduce = x2[:,0] * (np.max(x2[:,1])+1) + x2[:,1]
    sidx1 = np.argsort(x1_dimreduce)
    sidx2 = np.argsort(x2_dimreduce)
    return _find_mappings(sidx1, sidx2)

def searchsort_map(x1, x2):
    x1_view = x1.view(dtype=np.dtype([('', x1.dtype), ('', x1.dtype)])).ravel()
    x2_view = x2.view(dtype=np.dtype([('', x2.dtype), ('', x2.dtype)])).ravel()
    argidx = np.argsort(x1_view)
    sidx = np.searchsorted(x1_view, x2_view, sorter=argidx)
    sidx1 = argidx[sidx]
    sidx2 = np.arange(len(sidx))
    return _find_mappings(sidx1, sidx2)

def searchsort_map_v2(x1, x2):
    x1_view = x1.view(dtype=np.dtype([('', x1.dtype), ('', x1.dtype)])).ravel()
    x2_view = x2.view(dtype=np.dtype([('', x2.dtype), ('', x2.dtype)])).ravel()
    argidx = np.argsort(x1_view)
    sidx = np.searchsorted(x1_view, x2_view, sorter=argidx)
    return _find_mappings2(sidx, argidx)

fns = [lexsort_map, argsort_map, argsort_map_v2, searchsort_map, searchsort_map_v2]
in_ = {s: generate_x1_x2(s) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Length of x1 and x2')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

enter image description here

Update on 02/25/2022

There is a nice module numpy_indexed by @EelcoHoogendoorn (pip install numpy-indexed) written on top of numpy that extends to more advanced actions of indexing, grouping and set manipulations. You might like also to check out numpy_indexed.indices, Line 115 method:

numpy_indices as npi
x1_to_x2 = npi.indices(x1, x2)
x2_to_x1 = np.empty_like(x1_to_x2)
x2_to_x1[x1_to_x2] = np.arange(len(x2_to_x1))
mathfux
  • 5,759
  • 1
  • 14
  • 34
0

Using array comprehension this is what I do:

x1_in_x2 = [x2.index(e) for e in x1]

I do not know if this is what you expect. Please confirm If I understood you correctly.

Liam Hanninen
  • 1,525
  • 2
  • 19
  • 37
RubyLearning
  • 83
  • 1
  • 7
  • This works but only for lists; I guess I could use the `tolist()` method on the two arrays, and then turn the answer back into an array. But I'll wait to see if there is a numpy solution before accepting this answer. Thanks! – jonnybolton16 Feb 23 '22 at 15:08
-1

I don't know much about numpy, but here's one janky way to do it with just Python lists:

x1 = [[1, 2],
      [3, 4],
      [5, 6],
      [7, 8],
      [9, 10]]

x2 = [[3, 4],
      [7, 8],
      [1, 2],
      [5, 6],
      [9, 10]]

map_1_to_2 = []
map_2_to_1 = []

for element_1, element_2 in zip(x1, x2):
    map_1_to_2.append(x2.index(element_1))
    map_2_to_1.append(x1.index(element_2))

print(map_1_to_2)
print(map_2_to_1)
eccentricOrange
  • 876
  • 5
  • 18