Is there a fast way to find the first offset at which two large byte sequences differ?

Question

I can use a for loop to loop over two byte sequences and return the index at the first difference of course:

bytes1 = b'12345'
bytes2 = b'1F345'
for index, pair in enumerate(zip(bytes1, bytes2)):
    if pair[0] != pair[1]:
        print(index)
        break

But I don't think that's a smart and fast way to do it. I would hope a native method exists that I can call to get this done. Is there something that can help me here? I can also use numpy if it helps.

I also want to clarify that this will run many times, with medium sequences. Approximately 300MB is expected, chunked by 100kB. I might be able to change that for larger if it helps significantly.

What exactly are you trying to optimise for? You will get very different performance for short versus long sequences, or rather similar versus dissimilar sequences — MisterMiyagi, Nov 23 '22 at 19:53
@MisterMiyagi Thanks for asking, I really should have clarified that. I have done so now. — Tomáš Zato, Nov 23 '22 at 23:37

Ahmed AEK · Answer 1 · 2022-11-23T20:32:26.793

2

a solution with numpy is to convert them to an array of uint8 then xor them and use argmax to get the first non-zero.

import numpy as np
bytes1 = b'12345'
bytes2 = b'1F345'
bytes3 = np.frombuffer(bytes1,dtype=np.uint8)
bytes4 = np.frombuffer(bytes2,dtype=np.uint8)
max_loc = np.flatnonzero(bytes3 ^ bytes4)[0]
print(max_loc)

problem is that this still iterates to the end of the string on all functions, it's done in C so it is not too slow, slicing long array into multiple smaller ones can reduce the overhead for very long arrays.

Edit: modified argmax to the correct flatnonzero as pointed by @jasonharper, which throws an indexError if both inputs are equal.

edited Nov 23 '22 at 20:32

answered Nov 23 '22 at 19:50

Ahmed AEK

8,584
2
7
23

1

This won't actually work - it returns the index of *a* difference, but not necessarily the first difference. `np.flatnonzero(bytes3 ^ bytes4)[0]` works, but throws an IndexError if the inputs are equal. – jasonharper Nov 23 '22 at 20:19
we got `indexError` for : `bytes1 = b'12345' bytes2 = b'12345'` – I'mahdi Nov 23 '22 at 20:31
@I'mahdi checking the array for size 0 before accessing it or using a try block for performance to catch the error is up to the python version, there is a big difference in performance between 3.8 and 3.11, as try blocks have become cheaper but errors have become more expensive. – Ahmed AEK Nov 23 '22 at 20:40

score 1 · Answer 2 · answered Nov 23 '22 at 20:43

1

If using numba is ok:

import numba
@numba.jit()
def method2(bytes1, bytes2):
    idx = 0
    while idx < len(bytes1):
        if bytes1[idx] != bytes2[idx]:
            return idx
        idx += 1
    return idx

Note that first run of this function will be significantly slower (due to compilation performed by numba). Takes like 2 seconds.

Then, for each next run of the function:

easy case you posted, index = 1 -> numba is 2x faster,
for index = 100 -> numba is 33x faster

answered Nov 23 '22 at 20:43

dankal444

3,172
1
23
35

I cannot use anything else out of `numpy` and vanilla python. If that's the only way to do that to get significant performance boost, I will just have to use the naive loop. – Tomáš Zato Nov 23 '22 at 23:39
@TomášZato then you may want to look at [this answer](https://stackoverflow.com/a/71880904/4601890). Subtract one bytes array with the other and search for first non-zero. Preferably in chunks, so you do not have to subtract everything (maybe your 100kB chunking is already good enough, depends on statistics of bytes differences) – dankal444 Nov 24 '22 at 00:38
Thanks for the advice. But I don't think I inderstand how is searching for the first non zero any faster than looking for the first none equal numbers. It could be, but I don't think the comparison operation is the bottleneck here. – Tomáš Zato Nov 24 '22 at 13:08
@TomášZato it is about using numpy and its vectorization, Ahmed answer contains similar approach – dankal444 Nov 24 '22 at 13:39

Is there a fast way to find the first offset at which two large byte sequences differ?

2 Answers2