Finding first different character in two strings

Question

This has been asked before by a lot of users (see for instance this question). However, I would like to find an answer that is both optimized (I mean fast and which doesn't require too many resources to query) and Pythonic.

What I have so far is simple (and I believe Pythonic):

>>> str1 = "Where are you going?"
>>> str2 = "Where are you ever going?"
>>> for i, (char1, char2) in enumerate(zip(str1, str2)):
...     if char1 != char2:
...         print(f"Found different characters at pos {i}.")
...         break
...
Found different characters at pos 14.
>>> str1[14:]
'going?'
>>> str2[14:]
'ever going?'
>>>

It looks simple and effective. It uses zip and enumerate in ways I believe to be aligned with Python. You have to be careful if str1 and str2 aren't of the same length, obviously, so I would add an else statement to complement the for loop, but that's not the point: the thing is, if you have two strings of 5Mb, this little loop would create more than 15M variables, am I wrong? Allocate enough memory to store basically 20M unicode characters. It might be of no concern, but I'm wondering if something more effective exists.

Using regular expressions might sound like the right choice. But I believe it would actually slow the program down a lot and perhaps not be the best answer in terms of optimization. Personally, I find regular expressions harder to understand (and I have to admit I find my loop quite easy to read in comparison). Now, this is a performance question, so I might be off the mark by a lot!

Thanks for your feedback,

Solution

Doing it in one line is actually useful. One-liners are not always faster to run, but using the lazy property of generators, we actually have something interesting:

>>> index = next((i for i, (char1, char2) in enumerate(zip(str1, str2)) if char1 != char2), None)

The iterator itself, enumerate(zip(str1, str2)), will create a tuple (index, (char of str1 at that position, char of str2 at that position)).
The comparisons between each string character (char1 andchar2`) happens until the first different character.
This is due to lazy generators in Python. The wrapping next(generator expression, None) will call the generator expression until it returns a value (that is, until a different character is found in both strings). If that doesn't happen, if both strings are equal (or the difference happens beyond one of these str's length), then None is returned.

In other words, not only is this code optimized (it doesn't do more comparison than necessary and doesn't browse both strings even if it finds a different character at index 0 or 1), but it uses Python's own optimization for generators. I trust Python to run this line fast enough ("fast enough for me").

score 3 · Accepted Answer · answered Sep 10 '19 at 11:20

3

You may get the index with:

index = next((i for i in range(min(len(str1), len(str2))) if str1[i]!=str2[i]), None)

It will return None if they are the same.

if index is not None:
    print(str1[index:], str2[index:])

answered Sep 10 '19 at 11:20

ipaleka

3,745
2
13
33

1

But it answers the question completely - the provided code that should be optimized will not locate the index for your strings too. – ipaleka Sep 10 '19 at 12:00
Thanks, that works and that's using various strengths of Python. Just a quick check here: doing the search in a generator will be more optimized than running the loop as I did? The loop has one advantage, mainly it stops at the first different index, whereas your generator lists all the different indexes, so it keeps on running. But the loop is probably less optimized all in all. – vincent-lg Sep 11 '19 at 08:35
The `next` stops at the very first occurence - that None is there if it stops iteration without finding any; `all`, `min` and `max` needs all the values checked. – ipaleka Sep 11 '19 at 11:52
Thanks again. I edited the question to include your solution. I updated the generator expression to be `index = next((i for i, (char1, char2) in enumerate(zip(str1, str2)) if char1 != char2), None)`, because I find `zip` actually easier to read than `range(len)`. I had skipped the fact that generators are lazy in Python, thanks so much! – vincent-lg Sep 11 '19 at 20:02

Finding first different character in two strings

Solution

1 Answers1