Python - difference between two strings

Question

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojęzyczny and many of words like afrykanerskojęzycznym, afrykanerskojęzyczni, nieafrykanerskojęzyczni. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?

What do you mean by **"restore the second string from the first one and diff"?** — jrd1, Jul 28 '13 at 01:29
I believe he means "Make the second string the same as the first". — Elias Benevedes, Jul 28 '13 at 01:39
Are you looking for something like `difflib`? If so, see, e.g., http://stackoverflow.com/questions/774316/python-difflib-highlighting-differences-inline — torek, Jul 28 '13 at 02:35
I want to change one string to the other only on the "add commands". How does one do that with their api? — Charlie Parker, Jul 22 '22 at 13:45

score 165 · Answer 1 · edited May 23 '23 at 06:10

165

You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.

A simple example:

import difflib

cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),
       ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
       ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
       ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
       ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
       ('abcdefg','xac')] 

for a,b in cases:     
    print('{} => {}'.format(a,b))  
    for i,s in enumerate(difflib.ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    
    print()

prints:

afrykanerskojęzyczny => afrykanerskojęzycznym
Add "m" to position 20

afrykanerskojęzyczni => nieafrykanerskojęzyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2

afrykanerskojęzycznym => afrykanerskojęzyczny
Delete "m" from position 20

nieafrykanerskojęzyczni => afrykanerskojęzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2

nieafrynerskojęzyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "ę" from position 16

abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7

edited May 23 '23 at 06:10

Lenormju

4,078
2
8
22

answered Jul 28 '13 at 04:25

dawg

98,345
23
131
206

28

+1 Python has *so* many useful modules. It seems that I learn about a new one each day. – arshajii Jul 28 '13 at 04:29
1

This is stepping through the difference manually; restoring the different between the two strings, of course, is much easier with [difflib.restore](http://docs.python.org/2/library/difflib.html#difflib.restore) – dawg Jul 28 '13 at 04:48
Thanks! But i'm not sure if this is memory efficient. list(difflib.ndiff("afrykanerskojęzyczny","nieafrykanerskojęzyczny")) ['+ n', '+ i', '+ e', ' a', ' f', ' r', ' y', ' k', ' a', ' n', ' e', ' r', ' s', ' k', ' o', ' j', ' ę', ' z', ' y', ' c', ' z', ' n', ' y'] – user2626682 Jul 28 '13 at 11:08
1

`ndiff` is a generator so it is quite memory efficient. You are calling `list` on it which turns the individually generated character comparisons into a full list of them. You would only have a few in memory at a time if you did't call `list` on it. – dawg Jul 28 '13 at 13:54
Ok, but how can I save ndiff results to a list and save this list to a file? It is not possible to pickle generators in python. – user2626682 Jul 28 '13 at 14:43
Write the file as you go, then you do not care, right? Only save the differences, and, if the differences are small, that is its own compression. At a certain point -- the differences are what they are and they take up space. Skip the parts that are not differences. – dawg Jul 28 '13 at 20:46
Ok, but how can I convert ndiff results (generator object) to small object (smaller then list(ndiff))something that I can pickle? Should I convert to a list, and replace characters without "+" or "-" with None? It won't save size in bytes. If I remove chars completely i'll miss some information. – user2626682 Jul 28 '13 at 22:46
Just keep track of changes in a list. You don't need to keep the characters that are the same. To convert `afrykanerskojęzyczny => afrykanerskojęzycznym` just keep a list that says, in essence, 'delete "m" at position 20' You only need to keep string 1 and the net differences to produce string 2. It is all there for you. – dawg Jul 29 '13 at 15:09
I tried this with `case = [('cat', 'bat')]` and it starts going crazy and giving wrong positions... – Seekheart May 06 '16 at 15:06
I just tried that and I get `cat => bat Delete "c" from position 0 Add "b" to position 1` which is what is expected... Python 3.5.1 – dawg May 06 '16 at 17:11
1

Works on Python 2 as well (for me) I would suggest asking a question with the specific source and specific output. I cannot debug in comments... – dawg May 06 '16 at 17:24
**NOTE:** anagram-ish strings like `amalgam` and `malaga` may give hard-to-interpret results – Nathan majicvr.com Sep 01 '19 at 01:24
docs for python 3: https://docs.python.org/3/library/difflib.html – Charlie Parker Jul 21 '22 at 22:31
I want to change one string to the other only on the "add commands". How does one do that with their api? – Charlie Parker Jul 22 '22 at 13:45
@dawg In your 5th example, the indices refer to the first string, but I would like to get them in the second string. Is this possible? E.g., "k" was embedded in position 4 in the second string. I tried doing difflib.ndiff(b, a), but it seems I have the same issue. For instance, when a single character is replaced, I see one addition change in index i and then one deletion change in index i+1, but that does not make sense. – KLaz Sep 16 '22 at 10:55

score 52 · Answer 2 · edited Oct 04 '18 at 08:57

52

I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:

import difflib

case_a = 'afrykbnerskojęzyczny'
case_b = 'afrykanerskojęzycznym'

output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']

edited Oct 04 '18 at 08:57

ewerybody

1,443
16
29

answered Sep 21 '17 at 16:51

Eric

709
6
5

3

This is just what I was Googling for. One quick note, @Eric, your variables don't match as shown today, 20180905. Either 1) change the last line to `output_list = [li for li in list(difflib.ndiff(case_a,case_b)) if li[0] != ' ']` or 2) Change the string variables' names as `case_a -> a` and `case_b -> b`. Cheers! – bballdave025 Sep 05 '18 at 18:03
4

It might also be helpful to show the output of your command: `>>> output_list` ; #result# `['- b', '+ a', '+ m']` – bballdave025 Sep 05 '18 at 18:06
2

`if not li.startswith(' ')` is the eqivalent of `if li[0] != ' '` Some may find it more legible. Or even `if item.startswith(('-', '+', ))` – dmmfll Aug 24 '19 at 13:19
@DMfll Downvote. Lists don't have `startswith()` as of python `3.7.4` – Nathan majicvr.com Sep 01 '19 at 01:28
2

@Nathan you can't downvote comments, and while you're correct that lists don't have startswith, you've completely screwed up on the type. `li` is a string, not a list, and strings have had `startswith()` since at least 2012. Strings are indexable in the same way as arrays, because a string is essentially a glorified `unsigned (w)char[]`, which in fact is an array. Try running the code - `output_list = [li for li in difflib.ndiff(case_a, case_b) if not li.startswith(' ')]` works perfectly fine for me on 3.9 – Zoe Jul 20 '21 at 19:34
1

docs for python 3 https://docs.python.org/3/library/difflib.html – Charlie Parker Jul 21 '22 at 22:31

score 5 · Answer 3 · answered Jul 28 '13 at 03:26

You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:

import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni', 
            'nieafrykanerskojezyczni' ]
for q in queries:
    m = regex.search(r'(%s){e<=2}'%q, sequence)
    print 'match' if m else 'nomatch'

score 2 · Answer 4 · answered Sep 21 '16 at 23:16

What you are asking for is a specialized form of compression. xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. You'd want to use zlib.compressobj and zlib.decompressobj with the zdict parameter set to your "base word", e.g. afrykanerskojęzyczny.

Caveats are zdict is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.

score 2 · Answer 5 · answered Feb 26 '22 at 07:36

You might find the tools available in the NLTK library useful for calculating the difference between different words.

nltk.metrics.distance.edit_distance() is a mature (non-standard) library implementation that calculates the Levenshtein distance

A simple example might be:

from nltk.metrics.distance import *

w1 = 'wordone'
w2 = 'wordtwo'
edit_distance(w1, w2)

Out: 3

Additional parameter allow the output to be weighted, depending on the costs of different actions (substitutions/insertions) and different character differences (e.g. less cost for characters closer on the keyboard).

score 0 · Answer 6 · edited Jul 20 '21 at 19:08

0

The answer to my comment above on the Original Question makes me think this is all he wants:

loopnum = 0
word = 'afrykanerskojęzyczny'
wordlist = ['afrykanerskojęzycznym','afrykanerskojęzyczni','nieafrykanerskojęzyczni']
for i in wordlist:
    wordlist[loopnum] = word
    loopnum += 1

This will do the following:

For every value in wordlist, set that value of the wordlist to the origional code.

All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.

edited Jul 20 '21 at 19:08

Zoe

27,060
21
118
148

answered Jul 28 '13 at 04:00

Elias Benevedes

363
1
8
26

Thanks, but actually I'd like to store words like 'nieafrykanerskojęzyczni' in a memory efficient way, using similarity to 'afrykanerskojęzyczny'. – user2626682 Jul 28 '13 at 10:35

score 0 · Answer 7 · answered Mar 11 '23 at 18:04

I found this problem interesting and even if this is an old question, I wanted to try developing a solution without any external library.

import pytest

@pytest.mark.parametrize(
    "s1,s2,expected",
    [
        ("", "", ["", ""]),
        ("a", "a", ["", ""]),
        ("a", "b", ["a", "b"]),
        ("ac", "bc", ["a", "b"]),
        ("ca", "cb", ["a", "b"]),
        ("this is a message", "this is b message", ["a", "b"]),
        ("this is a huge message", "this is b message", ["a huge", "b"]),
    ],
)
def test_string_diff(s1, s2, expected):
    assert string_diff(s1, s2) == expected


def string_diff(s1: str, s2: str) -> str:
    d1, d2 = [], []
    i1 = i2 = 0
    l1 = len(s1)
    l2 = len(s2)
    while True:
        if i1 >= len(s1) or i2 >= len(s2):
            d1.extend(s1[i1:])
            d2.extend(s2[i2:])
            break
        if s1[i1] != s2[i2]:
            e1 = l1 - i1
            e2 = l2 - i2
            if e1 > e2:
                d1.append(s1[i1])
                i2 -= 1
            elif e1 < e2:
                d2.append(s2[i2])
                i1 -= 1
            else:
                d1.append(s1[i1])
                d2.append(s2[i2])
        i1 += 1
        i2 += 1
    return ["".join(d1), "".join(d2)]

Python - difference between two strings

7 Answers7

Linked

Related