14

Say I have a list of movie names with misspellings and small variations like this -

 "Pirates of the Caribbean: The Curse of the Black Pearl"
 "Pirates of the carribean"
 "Pirates of the Caribbean: Dead Man's Chest"
 "Pirates of the Caribbean trilogy"
 "Pirates of the Caribbean"
 "Pirates Of The Carribean"

How do I group or find such sets of words, preferably using python and/or redis?

abc def foo bar
  • 2,360
  • 6
  • 30
  • 41

5 Answers5

22

Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.

I'm especially fond of the difflib module

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

Community
  • 1
  • 1
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • 1
    the linked question seems to be deleted. – hardmooth Apr 07 '16 at 12:06
  • 1
    so it seems. When you reach a certain level of points you can still however see the deleted questions so I leve the link just as it is – Fredrik Pihl Apr 07 '16 at 13:14
  • 2
    @FredrikPihl could you please post the definition for `get_close_matches` here (or edit it in to the answer) for us unworthy low-reputation peasants? – Bryan K Feb 15 '18 at 01:50
  • 2
    It looks like I asked too soon - it's just a method as part of difflib: https://docs.python.org/2/library/difflib.html#difflib.get_close_matches – Bryan K Feb 15 '18 at 01:52
5

You might notice that similar strings have large common substring, for example:

"Bla bla bLa" and "Bla bla bRa" => common substring is "Bla bla ba" (notice the third word)

To find common substring you may use dynamic programming algorithm. One of algorithms variations is Levenshtein distance (distance between most similar strings is very small, and between more different strings distance is bigger) - http://en.wikipedia.org/wiki/Levenshtein_distance.

Also for quick performance you may try to adapt Soundex algorithm - http://en.wikipedia.org/wiki/Soundex.

So after calculating distance between all your strings, you have to clusterize them. The most simple way is k-means (but it needs you to define number of clusters). If you actually don't know number of clusters, you have to use hierarchical clustering. Note that number of clusters in your situation is number of different movies titles + 1(for totally bad spelled strings).

stemm
  • 5,960
  • 2
  • 34
  • 64
  • your so called substring "Bla bla ba" is not a substring in the conventional definition, since "ba" is in neither of your strings. I'd call it a "gapped substring". From the common gapped substring you can get the longest ungapped substring and thus the longest common substring. – hardmooth Apr 08 '16 at 06:02
  • Could you elaborate how to use k-means or hierarchical clustering on distances, as finding distance between words results in 2D array of distances of pairs of phrases ? – AAAA May 17 '21 at 14:04
2

I believe there is in fact two distinct problems.

The first is spell correction. You can have one in Python here

http://norvig.com/spell-correct.html

The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.

related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:

https://code.google.com/p/tfidf/

yogsototh
  • 14,371
  • 1
  • 22
  • 21
1

To add another tip to Fredrik's answer, you could also get inspired from search engines like code, such as this one :

def dosearch(terms, searchtype, case, adddir, files = []):
    found = []
    if files != None:
        titlesrch = re.compile('>title<.*>/title<')
        for file in files:
            title = ""
            if not (file.lower().endswith("html") or file.lower().endswith("htm")):
                continue
            filecontents = open(BASE_DIR + adddir + file, 'r').read()
            titletmp = titlesrch.search(filecontents)
            if titletmp != None:
                title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
            filecontents = remove_tags(filecontents)
            filecontents = filecontents.lstrip()
            filecontents = filecontents.rstrip()
            if dofind(filecontents, case, searchtype, terms) > 0:
                found.append(title)
                found.append(file)
    return found

Source and more information: http://www.zackgrossbart.com/hackito/search-engine-python/

Regards,

Max

JMax
  • 26,109
  • 12
  • 69
  • 88
0

One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.

Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

btk
  • 3,158
  • 2
  • 29
  • 30
  • [here](https://python.gotrained.com/nltk-edit-distance-jaccard-distance/#Edit_Distance) is a link that provides a good explanation of what this answer is trying to talk about. – Fatemeh Rahimi May 20 '19 at 16:55