Determine the shuffled indices of two lists/arrays

Question

As a challenge, I've given myself this problem:

Given 2 lists, A, and B, where B is a shuffled version of A, the idea is to figure out the shuffled indices.

For example:

A = [10, 40, 30, 2]
B = [30, 2, 10, 40]

result = [2,   3,    0,      1] 
        A[2]  A[3]   A[0]  A[1]
        ||     ||     ||    ||
        30      2     10    40

Note that ties for identical elements can be resolved arbitrarily.

I've come up with a solution that involves the use of a dictionary to store indices. What other possible solutions does this problem have? A solution using a library also works. Numpy, pandas, anything is fine.

using the list comprehension '[ A.index(x) for x in B ]' output is [2, 3, 0, 1] — Kallz, Aug 23 '17 at 04:59
@AshwiniChaudhary Thanks! Although I would say Divakar's solution is clearer to understand. — cs95, Aug 23 '17 at 05:04
@cᴏʟᴅsᴘᴇᴇᴅ Does [their own answer](https://stackoverflow.com/questions/33678543/finding-indices-of-matches-of-one-array-in-another-array) counts? Note that the accepted solution is first link is also doing the same thing. — Ashwini Chaudhary, Aug 23 '17 at 05:05
@AshwiniChaudhary That looks like a good dup. Think we should close this one. — Divakar, Aug 23 '17 at 05:10

Divakar · Answer 1 · 2017-08-23T05:36:59.107

8

We can make use of np.searchsorted with its optional sorter argument -

sidx = np.argsort(B)
out = sidx[np.searchsorted(B,A, sorter=sidx)]

Sample run -

In [19]: A = [10, 40, 30, 2, 40]
    ...: B = [30, 2, 10, 40]
    ...: 

In [20]: sidx = np.argsort(B)

In [21]: sidx[np.searchsorted(B,A, sorter=sidx)]
Out[21]: array([2, 3, 0, 1, 3])

edited Aug 23 '17 at 05:36

answered Aug 23 '17 at 05:00

Divakar

218,885
19
262
358

Just confirming, how does this handle duplicates? – cs95 Aug 23 '17 at 05:05
@cᴏʟᴅsᴘᴇᴇᴅ Dups in `A`? – Divakar Aug 23 '17 at 05:06
Yep, how is that handled? – cs95 Aug 23 '17 at 05:06
@cᴏʟᴅsᴘᴇᴇᴅ Handles pretty gracefully :) Should I add a sample for that? – Divakar Aug 23 '17 at 05:07
Sure... please do. – cs95 Aug 23 '17 at 05:18

score 3 · Answer 2 · answered Aug 23 '17 at 04:54

3

LOL

pd.Series(A).reset_index().set_index(0).ix[B].T.values[0]
#array([2, 3, 0, 1])

answered Aug 23 '17 at 04:54

DYZ

55,249
10
64
93

Probably should have kept pandas out of this :p – cs95 Aug 23 '17 at 05:02
Keep your word. Or else. – DYZ Aug 23 '17 at 05:03
2

Relax. I did say pandas. :p Here's my +1. – cs95 Aug 23 '17 at 05:04
Congrats on the 10K dude :-) Just noticed you reached it. – Christian Dean Aug 23 '17 at 06:51
A simpler pandas solution could be: `pd.Index(A).get_indexer_for(B)`. – Alex Riley Aug 23 '17 at 08:31

Christian Dean · Accepted Answer · 2017-08-23T05:28:55.153

3

As an improvement over your current solution, you could use collections.defaultdict and avoid dict.setdefault:

from collections import defaultdict

A = [10, 40, 30, 2]
B = [30, 2, 10, 40]

idx = defaultdict(list)
for i, l in enumerate(A):
    idx[l].append(i)

res = [idx[l].pop() for l in B]
print(res)

Here are the timings for the two methods using the sample input given:

Script used for testing

from timeit import timeit


setup = """
from collections import defaultdict;
idx1 = defaultdict(list); idx2 = {}
A = [10, 40, 30, 2]
B = [30, 2, 10, 40]
"""

me = """
for i, l in enumerate(A):
    idx1[l].append(i)
res = [idx1[l].pop() for l in B]
"""

coldspeed = """
for i, l in enumerate(A):
    idx2.setdefault(l, []).append(i)
res = [idx2[l].pop() for l in B]
"""

print(timeit(setup=setup, stmt=me))
print(timeit(setup=setup, stmt=coldspeed))

Results

original: 2.601998388010543
modified: 2.0607256239745766

So it appears that using defaultdict actually yields a slight speed increase. This actually makes since though since defaultdict is implemented in C rather than Python. Not to mention that the attribute lookup of the original solution - idx.setdefault1 - is costly.

edited Aug 23 '17 at 05:28

answered Aug 23 '17 at 05:07

Christian Dean

22,138
7
54
87

Nice. Does it make a big difference? – cs95 Aug 23 '17 at 05:08
@cᴏʟᴅsᴘᴇᴇᴅ Alright, I update with some timings if you wanna to take a look. – Christian Dean Aug 23 '17 at 05:23
Nice. Thanks for the neat answer. – cs95 Aug 23 '17 at 05:26
@ChristianDean for '[ A.index(x) for x in B ]' it give 0.569697856903 – Kallz Aug 23 '17 at 05:40
@Kallz I don't have time to confirm that right now, but I imagine that as the size of `A` and `B` increased the time spent searching for `x` in `A` using `index` would increase exponential. – Christian Dean Aug 23 '17 at 05:51

cs95 · Answer 4 · 2017-08-23T05:05:16.117

As mentioned in my question, I was able to solve this using a dictionary. I store the indices in a dict and then use a list comprehension to pop them out:

A = [10, 40, 30, 2]
B = [30, 2, 10, 40]

idx = {}
for i, l in enumerate(A):
    idx.setdefault(l, []).append(i)

res = [idx[l].pop() for l in B]
print(res)

Output:

[2, 3, 0, 1]

This is better than the obvious [A.index(x) for x in B] because it is

linear
handles duplicates gracefully

Eelco Hoogendoorn · Answer 5 · 2017-08-23T09:50:38.943

2

The numpy_indexed package has an efficient and general solution to this:

import numpy_indexed as npi
result = npi.indices(A, B)

Note that it has a kwarg to set a mode for dealing with missing values; and it works with nd-arrays of any type just the same, as it does with 1d integer arrays.

edited Aug 23 '17 at 09:50

answered Aug 23 '17 at 06:30

Eelco Hoogendoorn

10,459
1
44
42

Christian Dean · Answer 6 · 2017-08-23T06:36:50.000

Since several very nice solutions were posted, I've taken the liberty of assembling some crude timings to compare each method.

Script used for testing

from timeit import timeit


setup = """
from collections import defaultdict
import pandas as pd 
import numpy as np 
idx1 = defaultdict(list); idx2 = {}
A = [10, 40, 30, 2]
B = [30, 2, 10, 40]
"""

me = """
for i, l in enumerate(A):
    idx1[l].append(i)
res = [idx1[l].pop() for l in B]
"""

coldspeed = """
for i, l in enumerate(A):
    idx2.setdefault(l, []).append(i)
res = [idx2[l].pop() for l in B]
"""

divakar = """
sidx = np.argsort(B)
res = sidx[np.searchsorted(B,A, sorter=sidx)]
"""

dyz = """
res = pd.Series(A).reset_index().set_index(0).ix[B].T.values[0]
"""

print('mine:', timeit(setup=setup, stmt=me, number=1000))
print('coldspeed:', timeit(setup=setup, stmt=coldspeed, number=1000))
print('divakar:', timeit(setup=setup, stmt=divakar, number=1000))
print('dyz:', timeit(setup=setup, stmt=dyz, number=1000))

Result/Output (run on Jupyter notebook server. 1000 loops)

mine: 0.0026700650341808796
coldspeed: 0.0029303128831088543
divakar: 0.02583012101240456
dyz: 2.208147854078561

Here are some timings where the size of A is 100,000 random numbers. And B is its shuffled equivalent. The program was just too time and memory consuming. Also I had to reduce the number of loops to 100. Otherwise, everything is the same as above:

mine: 17.663535300991498
coldspeed: 17.11006522300886
divakar: 8.73397267702967
dyz: 44.61878849985078

Okay.. since you've done this much, can you try the same for a `random.randint()` array of size 100000, and its equivalent `random.shuffle()`'d version? — cs95, Aug 23 '17 at 05:57
@cᴏʟᴅsᴘᴇᴇᴅ Yup. I figured numpy would come out on top with bigger input. — Christian Dean, Aug 23 '17 at 06:37

Determine the shuffled indices of two lists/arrays

6 Answers6

Linked