Python: how many similar words in string?

Question

I have some ugly strings similar to these:

   string1 = 'Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)'
   string2 = 'Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)'

I would like a library or algorithm that will give me a percentage of how many words they have in common, while excluding special characters such as ',' and ':' and ''' and '{' etc.

I know of the Levenshtein algorithm. However, this compares numbers of similar CHARACTERS, whereas I would like to compare how many WORDS they have in common

The Levenshtein algorithm works on any 2 sequences of comparable objects ... another way of putting it: so long as `a[i] == b[j]` is defined and meaningful. — John Machin, Aug 25 '10 at 03:52

Nick T · Accepted Answer · 2010-08-25T03:14:56.667

Regex could easily give you all the words:

import re
s1 = "Fantini, Rauch, C.Straus, Priuli, Bertali: 'Festival Mass at the Imperial Court of Vienna, 1648' (Yorkshire Bach Choir & Baroque Soloists + Baroque Brass of London/Seymour)"
s2 = "Vinci, Leonardo {c.1690-1730}: Arias from Semiramide Riconosciuta, Didone Abbandonata, La Caduta dei Decemviri, Lo Cecato Fauzo, La Festa de Bacco, Catone in Utica. (Maria Angeles Peters sop. w.M.Carraro conducting)"
s1w = re.findall('\w+', s1.lower())
s2w = re.findall('\w+', s2.lower())

collections.Counter (Python 2.7+) can quickly count up the number of times a word occurs.

from collections import Counter
s1cnt = Counter(s1w)
s2cnt = Counter(s2w)

A very crude comparison could be done through set.intersection or difflib.SequenceMatcher, but it sounds like you would want to implement a Levenshtein algorithm that deals with words, where you could use those two lists.

common = set(s1w).intersection(s2w) 
# returns set(['c'])

import difflib
common_ratio = difflib.SequenceMatcher(None, s1w, s2w).ratio()
print '%.1f%% of words common.' % (100*common_ratio)

Prints: 3.4% of words similar.

+1 mainly for collections.Counter - another hidden gem of the stdlib. Sadly it's 2.7 so maybe not appliable. — , Aug 24 '10 at 16:56

score 2 · Answer 2 · edited May 23 '17 at 11:55

n = 0
words1 = set(sentence1.split())
for word in sentence2.split():
    # strip some chars here, e.g. as in [1]
    if word in words1:
        n += 1

(1: How to remove symbols from a string with Python?)

Edit: Note that this considers a word to be common to both sentences if it appears anywhere in both - to compare the position, you can omit the set conversion (just call split() on both), use something like:

n = 0
for word_from_1, word_from_2 in zip(sentence1.split(), sentence2.split()):
    # strip some chars here, e.g. as in [1]
    if word_from_1 == word_from_2:
        n += 1

Huh? This uses only built-in functions that are available without importing anything. — , Aug 24 '10 at 16:48

score 2 · Answer 3 · answered Aug 24 '10 at 16:52

The Lenvenshtein algorithm itself isn't restricted to comparing characters, it could compare any arbitrary objects. The fact that the classical form uses characters is an implementation detail, they could be any symbols or constructs that can be compared for equality.

In Python, convert the strings into lists of words then apply the algorithm to the lists. Maybe someone else can help you with cleaning up unwanted characters, presumably using some regular expression magic.

Python: how many similar words in string?

3 Answers3