This has been asked before by a lot of users (see for instance this question). However, I would like to find an answer that is both optimized (I mean fast and which doesn't require too many resources to query) and Pythonic.
What I have so far is simple (and I believe Pythonic):
>>> str1 = "Where are you going?"
>>> str2 = "Where are you ever going?"
>>> for i, (char1, char2) in enumerate(zip(str1, str2)):
... if char1 != char2:
... print(f"Found different characters at pos {i}.")
... break
...
Found different characters at pos 14.
>>> str1[14:]
'going?'
>>> str2[14:]
'ever going?'
>>>
It looks simple and effective. It uses zip
and enumerate
in ways I believe to be aligned with Python. You have to be careful if str1
and str2
aren't of the same length, obviously, so I would add an else
statement to complement the for loop, but that's not the point: the thing is, if you have two strings of 5Mb, this little loop would create more than 15M variables, am I wrong? Allocate enough memory to store basically 20M unicode characters. It might be of no concern, but I'm wondering if something more effective exists.
Using regular expressions might sound like the right choice. But I believe it would actually slow the program down a lot and perhaps not be the best answer in terms of optimization. Personally, I find regular expressions harder to understand (and I have to admit I find my loop quite easy to read in comparison). Now, this is a performance question, so I might be off the mark by a lot!
Thanks for your feedback,
Solution
Doing it in one line is actually useful. One-liners are not always faster to run, but using the lazy property of generators, we actually have something interesting:
>>> index = next((i for i, (char1, char2) in enumerate(zip(str1, str2)) if char1 != char2), None)
- The iterator itself,
enumerate(zip(str1, str2))
, will create a tuple(index, (char of str1 at that position, char of str2 at that position))
. - The comparisons between each string character (
char1 and
char2`) happens until the first different character. - This is due to lazy generators in Python. The wrapping
next(generator expression, None)
will call the generator expression until it returns a value (that is, until a different character is found in both strings). If that doesn't happen, if both strings are equal (or the difference happens beyond one of these str's length), thenNone
is returned.
In other words, not only is this code optimized (it doesn't do more comparison than necessary and doesn't browse both strings even if it finds a different character at index 0 or 1), but it uses Python's own optimization for generators. I trust Python to run this line fast enough ("fast enough for me").