Python: Find fast match in 2 lists (Collision)

Question

I have a problem with efficiency in python. For several days I now tried to find and test different implementations, but just can't get it to work fast enough.

Task: I have two lists arr and arr2 with 1000000 entries of b'' (resulting of encryption and decryption with keys "00000000" to "00999999") and know have to find a match in those two(DES2 Meet in the Middle). Therefore

if arr[x] == arr2[y]:
    return x, y

I think that my solution would work, but its so slow, that its not enough:

def g(x):
    for k in range(0, 1000000):
        if (arr[x] == arr2[k]):
            return k, x
    return 0

keypair = 0

for k in range(0, 1000000):
    if keypair == 0:
        keypair = g(k)
    else:
        break

The problem is, that the keypair should be found in less than 40 seconds, but e.g. this example takes 1 minute for 1000 of the 1000000 entries.

Now somehow the efficiency has to be improved in certain ways. I tried around with some other solutions, that promised more efficiency e.g. map and numpys solution for arrays, but didn't get any real improvement in speed.

Do you know any smart solutions?

Python is rather slow, and it especially shows in tight loops. If pypy doesn't work code it in c (cython). — simonzack, Dec 23 '14 at 12:36
why do you have two loops? as simonzack said cython will markedly improve performance. — Padraic Cunningham, Dec 23 '14 at 12:36
This video from scipy 2013 will show you what to do https://www.youtube.com/watch?v=KJFaJoJCqlU — Padraic Cunningham, Dec 23 '14 at 12:40
You can use parallel processing. Create pool of processes. Each process runs on g(). Break the range (1000000). Give range,x to each process. You can get result in <40 sec. — Netro, Dec 23 '14 at 12:42
Have you tried sorting the lists, with something like `sorted(enumerate(a), key = lambda x: x[1])` to preserve the original indices? — fizzer, Dec 23 '14 at 12:47
Ok, but the task was given with python as tool. It was not the intention to use cython or something else. I could however look up cython if there is no smart solution in python. And parallel processing. I already tried around with pools and fork() but either i didn't get it to work, or it was still very slow. — Timotheus, Dec 23 '14 at 12:53
Is [*this*](http://stackoverflow.com/a/4299378/23771) of any help? — Mike Dunlavey, Dec 23 '14 at 13:40
I'm probably missing something obvious, but would set(arr1).intersection(set(arr2)) get you what you want? (Potentially following up with index if you also need the locations of the matches.) — iayork, Dec 23 '14 at 14:20

Michele d'Amico · Accepted Answer · 2014-12-23T17:27:26.260

I'm not sure if you are only interested on existence or you need the indexes in the lists. That solution is about indexes case. If you need just check about existence just use g as set and not dictionary and check if k is in g for each k in arr2.

If the space is not a issue try by dictionary:

g = {k:i for i,k in enumerate(arr)}

keypair = None
for i,k in enumerate(arr2):
    if k in g:
        keypair = (i, g[k])
        break;

Hashing table are amortized O(1) on find operations.

Anyway write in cython is not hard (even using native types) but if you don't change algorithm and still use a O(n^2) approach you will fail your target.

That approach is a O(n) but use a lot of memory and maybe high constants. The dimension of your problem (n=1000000) should make it the best but maybe sort the arrays values and then take in account that they are ordered can be faster even is a O(nlog(n)) approach. If the first fail you can try by sort data.

Dictionarys are very fast and exactly what solved my problem. Thank you. — Timotheus, Dec 23 '14 at 16:28

Python: Find fast match in 2 lists (Collision)

1 Answers1