3

Im trying to compare 2 1000 byte string and would like to know where the difference exactly starts, ie; from which byte the string is different.. Is there any function to determine it?

user1050619
  • 19,822
  • 85
  • 237
  • 413

6 Answers6

9

Maybe use next plus a generator?

next(idx for idx,c in enumerate(your_string1) if c != your_string2[idx])

This will give you the index where the difference starts and raise StopIteration if they are equal.

It might even be slightly more elegant with itertools.izip:

next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)

Example:

>>> from itertools import izip
>>> s1 = 'stop at the bus'
>>> s2 = 'stop at the car'
>>> next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
12
>>> s1[12]
'b'
>>> s2[12]
'c'
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • One of my favorite patterns in Python. I've often wondered whether this was foreseen when genexes were introduced, or whether it's just the natural result of good design. – senderle Oct 11 '12 at 15:47
  • @senderle -- Yeah, I'm not completely sure. I'd guess that it was forseen, but I suppose to know for sure we'd need to ask Guido and the python devs. It's quickly becoming one of my favorite patterns as well. Very useful. – mgilson Oct 11 '12 at 16:35
4

I've tried to test the answers given here and I've came up with an other, faster(in usual cases), solution, even though it is less elegant.

First of all let's see how fast are the proposed solutions:

In [15]: def check_genexp(a, b):
    ...:     return next(idx for idx, c in enumerate(a) if c != b[idx])

In [16]: %timeit check_genexp("a"*9999 + "b", "a"*9999 + "c")
1000 loops, best of 3: 1.04 ms per loop

In [17]: from difflib import SequenceMatcher

In [18]: def check_matcher(a, b):
    ...:     return next(SequenceMatcher(a=a, b=b).get_matching_blocks())
    ...: 

In [19]: %timeit check_matcher("a"*9999+"b", "a"*9999+"c")
100 loops, best of 3: 11.5 ms per loop

As you can see the genexp is a lot faster than difflib, but this is probably due to the fact that SequenceMatcher does a lot more than finding the first non-equal character.

Now, how could we speed things up? Well, we can use "binary search"!!! The idea is that if two strings are not equal, than either the first halves are different or the second one are different(or both, but in that case we only care for the first half since we want the first differing index).

So we can do something like this:

def binary_check(a, b):
    len_a, len_b = len(a), len(b)
    if len_a == len_b:
        return binary_check_helper(a, b)
    min_length, max_length = min(len_a, len_b), max(len_a, len_b)
    res = binary_check_helper(a[:min_length], b[:min_length])
    return res if res >= 0 else min_length

def binary_check_helper(a, b):
    if a == b:
        return -1
    length = len(a)

    if length == 1:
        return int(a[0] == b[0])
    else:
        half_length = length // 2
        r = binary_check_helper(a[:half_length], b[:half_length])
        if r >= 0:
            return r
        r = binary_check(a[half_length:], b[half_length:])
        if r >= 0:
            return r + half_length
        return r

And the result:

In [34]: %timeit binary_check("a"*9999 + "b", "a"*9999 + "c")
10000 loops, best of 3: 28.4 µs per loop

That's more than thirty five times faster than the genexp!

Why this works? The comparisons obviously take linear time, so it looks like we are doing a lot more work than before... and that's indeed true but it's done at the "C level", and thus the result is that this method is actually faster.

Note that this is somehow "implementation specific" because implementations such as PyPy could probably optimize the genexp into a single C-for loop and that would beat anything; also on implementations like Jython or IronPython could be a lot slower than with CPython.

This method has the same asymptotic complexity of the other methods, i.e. O(n). The strings are splitted in half at most log_2(n) times and each time an equality test is done, which takes linear time. At first sight it may seem a Θ(n * logn) algorithm, but that's not the case. The recurrence equation is:

T(n) = T(n//2) + Θ(n) = Σ_{i=0}^{logn}(n/2^i)
     = Θ(n(1 + 1/2 + 1/4 + ...)) <= Θ(2n) = Θ(n)

Some more results:

In [37]: %timeit binary_check("a"*10**6 + "b", "a"*10**6 + "c")
100 loops, best of 3: 2.84 ms per loop

In [38]: %timeit check_genexp("a"*10**6 + "b", "a"*10**6 + "c")
10 loops, best of 3: 101 ms per loop

In [39]: %timeit binary_check(15 * "a"*10**6 + "b", 15 * "a"*10**6 + "c")
10 loops, best of 3: 53.3 ms per loop

In [40]: %timeit check_genexp(15 * "a"*10**6 + "b", 15 * "a"*10**6 + "c")
1 loops, best of 3: 1.5 s per loop

As you can see even with huge strings this method is still about thirty times faster.

Note: The downside of this solution is that it is ϴ(n) and not O(n), i.e. it always read the whole string to return the result. Even when the first character is already different. In fact:

In [49]: a = "b" + "a" * 15 * 10**6
    ...: b = "c" + "a" * 15 * 10**6
    ...: 

In [50]: %timeit binary_check(a, b)
100 loops, best of 3: 10.3 ms per loop

In [51]: %timeit check_genexp(a, b)
1000000 loops, best of 3: 1.3 µs per loop

This is to be expected. However it takes very little for this solution to become more performant then the explicit loop:

In [59]: a = "a" * 2 * 10**5 + "b" + "a" * 15*10**6
    ...: b = "a" * 2 * 10**5 + "c" + "a" * 15*10**6

In [60]: %timeit check_genexp(a, b)
10 loops, best of 3: 20.3 ms per loop

In [61]: %timeit binary_check(a, b)
100 loops, best of 3: 17.3 ms per loop

According to this simple benchmark, with a big string if the difference is farther than about 1.3% of the total length, the binary check is better.

It is also possible to introduce some heuristics. For example if the minimum length of the two strings is greater than a certain cutoff value you first only check whether the prefixes at that cutoff are different, if they are you can't disregard everything after that, thus avoiding comparing the whole strings. This can be trivially implemented:

def binary_check2(a, b, cutoff=1000):
    len_a, len_b = len(a), len(b)
    if min(len_a, len_b) > cutoff:
        small_a, small_b = a[:cutoff], b[:cutoff]
        if small_a != small_b:
            return binary_check_helper(a[:cutoff], b[:cutoff])
    # same as before

Depending on the application you can choose a cutoff that minimize the average time. In any case this is an ad hoc heuristics that may or may not work well, so if you are dealing with very long strings with only short common prefixes you should use a "fail-fast" algorithm as is the genexp approach.


timings performed on python3.4. Using bytes instead of unicode strings doesn't change significantly the results

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • +1 -- clever! The fact that this works so well amuses me. I wonder how long the string can get before this strategy becomes slower. – senderle Oct 11 '12 at 15:41
  • @senderle: a lot of. I've added some more "profiling" runs, and you can see that it's quite hard to make it less than 8 times faster. – Bakuriu Oct 11 '12 at 15:47
  • Great! Although less pythonic than gen-expression and the fact that OP hasn't specifically asked for performance, I love this approach as it's shown me a better way for coding many things. Well not really new, it's something I already knew! – 0xc0de Dec 30 '16 at 05:15
2
for i, (x, y) in enumerate(zip(a, b)):
    if x != y:
        print('First index where strings are different:', i)
        break
else:
    print('Strings are identical.')

In Python 2.x, zip() returns a list of tuples, not a iterator. As gnibbler pointed out, if you're using Python 2.x, it might be worth your while to use izip rather than zip (izip returns a nice, memory efficient iterator that avoids evaluating all of the values at once). As I said in the commments though, in Python 3, izip has been renamed zip and the old zip is gone.

Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
2

If you want something more complex, you can have a look at SequenceMatcher

It is a bit hairy, but very powerful. If you simply want to answer your question, then :

from  difflib import SequenceMatcher

s1 = 'stop at the bus'
s2 = 'stop at the car'

s = SequenceMatcher(None, s1, s2)

print s.get_matching_blocks()[0].size

returns the solution :)

But if you want all the matches :

Small example:

from  difflib import SequenceMatcher

s1 = 'stop at the bus'
s2 = 'stop at the car'

s = SequenceMatcher(None, s1, s2)

print s.get_matching_blocks()

returns

[Match(a=0, b=0, size=12), Match(a=15, b=15, size=0)]

which means that the longest match in your strings is of size 12, and starts at the beginning (0). But there is another match, starting at s1[15], and of size 0 . . .

For big strings like yours, this is something that could be really interesting. :)

jlengrand
  • 12,152
  • 14
  • 57
  • 87
1
>>> s1 = 'stop at the bus'
>>> s2 = 'stop at the car'
>>> import difflib
>>> next(difflib.SequenceMatcher(a=s1, b=s2).get_matching_blocks())
Match(a=0, b=0, size=12)

This means the the first matching block is 12 characters long.

If either a or b isn't 0, the strings differ right from the beginning

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • Indeed. This question made me discover a really interesting lib. – jlengrand Oct 11 '12 at 13:41
  • @jlengrand, yes it's very powerful, but not very clean for this particular question. – John La Rooy Oct 11 '12 at 13:43
  • @gnibbler -- I've not used `difflib` much, so it's nice to see this example. Thanks. (and +1 -- I think it's reasonably clean -- Certainly not much worse than `next( ... enumerate(izip(s1,s2)) ...)`) – mgilson Oct 11 '12 at 13:47
  • @gnibbler Funny that the guy with more points gets the +1, even though the other one posted before with the same solution :D. Why would you call it unclean? Better use a standard lib than custom code that could be buggy, isn't it ? – jlengrand Oct 11 '12 at 13:49
  • @jlengrand, because in this case you can't just check the size to get the result. you have to do something like `match.size if match.a==match.b==0 else 0` – John La Rooy Oct 11 '12 at 13:57
1

This might be overkill, but since you seem to be concerned about speed, you could consider using numpy. There are probably improvements to be made (for some reason inlining made a 25 us difference for me), but this is a first step:

>>> def diff_index(s1, s2):
...     s1 = numpy.fromstring(s1, dtype=numpy.uint8)
...     s2 = numpy.fromstring(s2, dtype=numpy.uint8)
...     return (~(s1 == s2)).nonzero()[0][0]
... 
>>> base = string.lowercase * 385
>>> s1 = base + 'a'
>>> s2 = base + 'z'
>>> diff_index(s1, s2)
10010

For differences at the end, this is a lot faster than a genex:

>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
1000 loops, best of 3: 1.46 ms per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.6 us per loop

It's a lot slower for differences at the very beginning...

>>> s1 = 'a' + base
>>> s2 = 'z' + base
>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
100000 loops, best of 3: 2.12 us per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.5 us per loop

But on average, it wins by an order of magnitude:

>>> s1 = base[:5000] + 'a' + base[5000:]
>>> s2 = base[:5000] + 'z' + base[5000:]
>>> %timeit next(idx for idx,(c1,c2) in enumerate(izip(s1,s2)) if c1 != c2)
1000 loops, best of 3: 724 us per loop
>>> %timeit diff_index(s1, s2)
10000 loops, best of 3: 87.2 us per loop

If speed is not a concern, though, then I'd personally go for mgilson's answer.

Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233