I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojęzyczny
and many of words like afrykanerskojęzycznym
, afrykanerskojęzyczni
, nieafrykanerskojęzyczni
. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?

- 3,281
- 7
- 29
- 64

- 1,379
- 2
- 9
- 3
-
1What do you mean by **"restore the second string from the first one and diff"?** – jrd1 Jul 28 '13 at 01:29
-
2I believe he means "Make the second string the same as the first". – Elias Benevedes Jul 28 '13 at 01:39
-
1@EliasBenevedes, exactly :). – user2626682 Jul 28 '13 at 01:44
-
1Are you looking for something like `difflib`? If so, see, e.g., http://stackoverflow.com/questions/774316/python-difflib-highlighting-differences-inline – torek Jul 28 '13 at 02:35
-
I want to change one string to the other only on the "add commands". How does one do that with their api? – Charlie Parker Jul 22 '22 at 13:45
7 Answers
You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.
A simple example:
import difflib
cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),
('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
('abcdefg','xac')]
for a,b in cases:
print('{} => {}'.format(a,b))
for i,s in enumerate(difflib.ndiff(a, b)):
if s[0]==' ': continue
elif s[0]=='-':
print(u'Delete "{}" from position {}'.format(s[-1],i))
elif s[0]=='+':
print(u'Add "{}" to position {}'.format(s[-1],i))
print()
prints:
afrykanerskojęzyczny => afrykanerskojęzycznym
Add "m" to position 20
afrykanerskojęzyczni => nieafrykanerskojęzyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2
afrykanerskojęzycznym => afrykanerskojęzyczny
Delete "m" from position 20
nieafrykanerskojęzyczni => afrykanerskojęzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
nieafrynerskojęzyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "ę" from position 16
abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7
-
28+1 Python has *so* many useful modules. It seems that I learn about a new one each day. – arshajii Jul 28 '13 at 04:29
-
1This is stepping through the difference manually; restoring the different between the two strings, of course, is much easier with [difflib.restore](http://docs.python.org/2/library/difflib.html#difflib.restore) – dawg Jul 28 '13 at 04:48
-
Thanks! But i'm not sure if this is memory efficient. list(difflib.ndiff("afrykanerskojęzyczny","nieafrykanerskojęzyczny")) ['+ n', '+ i', '+ e', ' a', ' f', ' r', ' y', ' k', ' a', ' n', ' e', ' r', ' s', ' k', ' o', ' j', ' ę', ' z', ' y', ' c', ' z', ' n', ' y'] – user2626682 Jul 28 '13 at 11:08
-
1`ndiff` is a generator so it is quite memory efficient. You are calling `list` on it which turns the individually generated character comparisons into a full list of them. You would only have a few in memory at a time if you did't call `list` on it. – dawg Jul 28 '13 at 13:54
-
Ok, but how can I save ndiff results to a list and save this list to a file? It is not possible to pickle generators in python. – user2626682 Jul 28 '13 at 14:43
-
Write the file as you go, then you do not care, right? Only save the differences, and, if the differences are small, that is its own compression. At a certain point -- the differences are what they are and they take up space. Skip the parts that are not differences. – dawg Jul 28 '13 at 20:46
-
Ok, but how can I convert ndiff results (generator object) to small object (smaller then list(ndiff))something that I can pickle? Should I convert to a list, and replace characters without "+" or "-" with None? It won't save size in bytes. If I remove chars completely i'll miss some information. – user2626682 Jul 28 '13 at 22:46
-
Just keep track of changes in a list. You don't need to keep the characters that are the same. To convert `afrykanerskojęzyczny => afrykanerskojęzycznym` just keep a list that says, in essence, 'delete "m" at position 20' You only need to keep string 1 and the net differences to produce string 2. It is all there for you. – dawg Jul 29 '13 at 15:09
-
I tried this with `case = [('cat', 'bat')]` and it starts going crazy and giving wrong positions... – Seekheart May 06 '16 at 15:06
-
I just tried that and I get `cat => bat Delete "c" from position 0 Add "b" to position 1` which is what is expected... Python 3.5.1 – dawg May 06 '16 at 17:11
-
1Works on Python 2 as well (for me) I would suggest asking a question with the specific source and specific output. I cannot debug in comments... – dawg May 06 '16 at 17:24
-
**NOTE:** anagram-ish strings like `amalgam` and `malaga` may give hard-to-interpret results – Nathan majicvr.com Sep 01 '19 at 01:24
-
docs for python 3: https://docs.python.org/3/library/difflib.html – Charlie Parker Jul 21 '22 at 22:31
-
I want to change one string to the other only on the "add commands". How does one do that with their api? – Charlie Parker Jul 22 '22 at 13:45
-
@dawg In your 5th example, the indices refer to the first string, but I would like to get them in the second string. Is this possible? E.g., "k" was embedded in position 4 in the second string. I tried doing difflib.ndiff(b, a), but it seems I have the same issue. For instance, when a single character is replaced, I see one addition change in index i and then one deletion change in index i+1, but that does not make sense. – KLaz Sep 16 '22 at 10:55
I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:
import difflib
case_a = 'afrykbnerskojęzyczny'
case_b = 'afrykanerskojęzycznym'
output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']
-
3This is just what I was Googling for. One quick note, @Eric, your variables don't match as shown today, 20180905. Either 1) change the last line to `output_list = [li for li in list(difflib.ndiff(case_a,case_b)) if li[0] != ' ']` or 2) Change the string variables' names as `case_a -> a` and `case_b -> b`. Cheers! – bballdave025 Sep 05 '18 at 18:03
-
4It might also be helpful to show the output of your command: `>>> output_list` ; #result# `['- b', '+ a', '+ m']` – bballdave025 Sep 05 '18 at 18:06
-
2`if not li.startswith(' ')` is the eqivalent of `if li[0] != ' '` Some may find it more legible. Or even `if item.startswith(('-', '+', ))` – dmmfll Aug 24 '19 at 13:19
-
@DMfll Downvote. Lists don't have `startswith()` as of python `3.7.4` – Nathan majicvr.com Sep 01 '19 at 01:28
-
2@Nathan you can't downvote comments, and while you're correct that lists don't have startswith, you've completely screwed up on the type. `li` is a string, not a list, and strings have had `startswith()` since at least 2012. Strings are indexable in the same way as arrays, because a string is essentially a glorified `unsigned (w)char[]`, which in fact is an array. Try running the code - `output_list = [li for li in difflib.ndiff(case_a, case_b) if not li.startswith(' ')]` works perfectly fine for me on 3.9 – Zoe Jul 20 '21 at 19:34
-
1docs for python 3 https://docs.python.org/3/library/difflib.html – Charlie Parker Jul 21 '22 at 22:31
You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:
import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni',
'nieafrykanerskojezyczni' ]
for q in queries:
m = regex.search(r'(%s){e<=2}'%q, sequence)
print 'match' if m else 'nomatch'

- 94,503
- 21
- 155
- 181
What you are asking for is a specialized form of compression. xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. You'd want to use zlib.compressobj
and zlib.decompressobj
with the zdict
parameter set to your "base word", e.g. afrykanerskojęzyczny
.
Caveats are zdict
is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.

- 119
- 1
You might find the tools available in the NLTK library useful for calculating the difference between different words.
nltk.metrics.distance.edit_distance()
is a mature (non-standard) library implementation that calculates the Levenshtein distance
A simple example might be:
from nltk.metrics.distance import *
w1 = 'wordone'
w2 = 'wordtwo'
edit_distance(w1, w2)
Out: 3
Additional parameter allow the output to be weighted, depending on the costs of different actions (substitutions/insertions) and different character differences (e.g. less cost for characters closer on the keyboard).

- 1,597
- 1
- 14
- 29
The answer to my comment above on the Original Question makes me think this is all he wants:
loopnum = 0
word = 'afrykanerskojęzyczny'
wordlist = ['afrykanerskojęzycznym','afrykanerskojęzyczni','nieafrykanerskojęzyczni']
for i in wordlist:
wordlist[loopnum] = word
loopnum += 1
This will do the following:
For every value in wordlist, set that value of the wordlist to the origional code.
All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.

- 27,060
- 21
- 118
- 148

- 363
- 1
- 8
- 26
-
Thanks, but actually I'd like to store words like 'nieafrykanerskojęzyczni' in a memory efficient way, using similarity to 'afrykanerskojęzyczny'. – user2626682 Jul 28 '13 at 10:35
I found this problem interesting and even if this is an old question, I wanted to try developing a solution without any external library.
import pytest
@pytest.mark.parametrize(
"s1,s2,expected",
[
("", "", ["", ""]),
("a", "a", ["", ""]),
("a", "b", ["a", "b"]),
("ac", "bc", ["a", "b"]),
("ca", "cb", ["a", "b"]),
("this is a message", "this is b message", ["a", "b"]),
("this is a huge message", "this is b message", ["a huge", "b"]),
],
)
def test_string_diff(s1, s2, expected):
assert string_diff(s1, s2) == expected
def string_diff(s1: str, s2: str) -> str:
d1, d2 = [], []
i1 = i2 = 0
l1 = len(s1)
l2 = len(s2)
while True:
if i1 >= len(s1) or i2 >= len(s2):
d1.extend(s1[i1:])
d2.extend(s2[i2:])
break
if s1[i1] != s2[i2]:
e1 = l1 - i1
e2 = l2 - i2
if e1 > e2:
d1.append(s1[i1])
i2 -= 1
elif e1 < e2:
d2.append(s2[i2])
i1 -= 1
else:
d1.append(s1[i1])
d2.append(s2[i2])
i1 += 1
i2 += 1
return ["".join(d1), "".join(d2)]

- 383
- 3
- 11