2

I can use a for loop to loop over two byte sequences and return the index at the first difference of course:

bytes1 = b'12345'
bytes2 = b'1F345'
for index, pair in enumerate(zip(bytes1, bytes2)):
    if pair[0] != pair[1]:
        print(index)
        break

But I don't think that's a smart and fast way to do it. I would hope a native method exists that I can call to get this done. Is there something that can help me here? I can also use numpy if it helps.

I also want to clarify that this will run many times, with medium sequences. Approximately 300MB is expected, chunked by 100kB. I might be able to change that for larger if it helps significantly.

Tomáš Zato
  • 50,171
  • 52
  • 268
  • 778

2 Answers2

2

a solution with numpy is to convert them to an array of uint8 then xor them and use argmax to get the first non-zero.

import numpy as np
bytes1 = b'12345'
bytes2 = b'1F345'
bytes3 = np.frombuffer(bytes1,dtype=np.uint8)
bytes4 = np.frombuffer(bytes2,dtype=np.uint8)
max_loc = np.flatnonzero(bytes3 ^ bytes4)[0]
print(max_loc)
1

problem is that this still iterates to the end of the string on all functions, it's done in C so it is not too slow, slicing long array into multiple smaller ones can reduce the overhead for very long arrays.

Edit: modified argmax to the correct flatnonzero as pointed by @jasonharper, which throws an indexError if both inputs are equal.

Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23
  • 1
    This won't actually work - it returns the index of *a* difference, but not necessarily the first difference. `np.flatnonzero(bytes3 ^ bytes4)[0]` works, but throws an IndexError if the inputs are equal. – jasonharper Nov 23 '22 at 20:19
  • we got `indexError` for : `bytes1 = b'12345' bytes2 = b'12345'` – I'mahdi Nov 23 '22 at 20:31
  • @I'mahdi checking the array for size 0 before accessing it or using a try block for performance to catch the error is up to the python version, there is a big difference in performance between 3.8 and 3.11, as try blocks have become cheaper but errors have become more expensive. – Ahmed AEK Nov 23 '22 at 20:40
1

If using numba is ok:

import numba
@numba.jit()
def method2(bytes1, bytes2):
    idx = 0
    while idx < len(bytes1):
        if bytes1[idx] != bytes2[idx]:
            return idx
        idx += 1
    return idx

Note that first run of this function will be significantly slower (due to compilation performed by numba). Takes like 2 seconds.

Then, for each next run of the function:

  • easy case you posted, index = 1 -> numba is 2x faster,
  • for index = 100 -> numba is 33x faster
dankal444
  • 3,172
  • 1
  • 23
  • 35
  • I cannot use anything else out of `numpy` and vanilla python. If that's the only way to do that to get significant performance boost, I will just have to use the naive loop. – Tomáš Zato Nov 23 '22 at 23:39
  • @TomášZato then you may want to look at [this answer](https://stackoverflow.com/a/71880904/4601890). Subtract one bytes array with the other and search for first non-zero. Preferably in chunks, so you do not have to subtract everything (maybe your 100kB chunking is already good enough, depends on statistics of bytes differences) – dankal444 Nov 24 '22 at 00:38
  • Thanks for the advice. But I don't think I inderstand how is searching for the first non zero any faster than looking for the first none equal numbers. It could be, but I don't think the comparison operation is the bottleneck here. – Tomáš Zato Nov 24 '22 at 13:08
  • @TomášZato it is about using numpy and its vectorization, Ahmed answer contains similar approach – dankal444 Nov 24 '22 at 13:39