4

I'm trying to understand, how does Peter Norvig's spelling corrector work.

On his jupyter-notebook title here he explains, how to segment a sequence of characters without spaces separating words. It works correct, when all the words in sequence are written correct:

>>> segment("deeplearning")
['deep', 'learning']

But when the word (or some words) in sequence is misspelled, it works incorrect:

>>> segment("deeplerning")
['deep', 'l', 'erning']

Unfortunately, I have no idea how to fix this and make the segment() function work with concatenated words with misspellings.

Does anybody have an idea how to deal with this problem?

1 Answers1

1

It could be achieved by Peter Norvig's algorithm with minor changes. The trick is to add a space character to the alphabet and treat all bigrams separated by space character as a unique word.

Since big.txt doesn't contain deep learning bigram we will have to add a little bit more text to our dictionary. I will use wikipedia library (pip install wikipedia) to get more text.

import re
import wikipedia as wiki
import nltk
from nltk.tokenize import word_tokenize
unigrams =  re.findall(r"\w+", open("big.txt").read().lower())
for deeplerning in wiki.search("Deep Learning"):
    try:
        page = wiki.page(deeplerning).content.lower()
        page = page.encode("ascii", errors="ignore")
        unigrams = unigrams + word_tokenize(page)
    except:
        break

I will create a new dictionary with all unigrams and bigrams:

fo = open("new_dict.txt", "w")
for u in unigrams:
    fo.write(u + "\n")
bigrams = list(nltk.bigrams(unigrams))
for b in bigrams:
    fo.write(" ".join(b)+ "\n")
fo.close()

Now just add a space character to the letters variable in edits1 function, change big.txt to new_dict.txt and change this function:

def words(text): return re.findall(r'\w+', text.lower())

to this:

def words(text): return text.split("\n")

and now correction("deeplerning") returns 'deep learning'!

This trick will perform well if you need a spelling corrector for specific domain. If this domain is big you can try to add to your dictionary only most common unigrams / bigrams.

This question also may help.

Vlad
  • 8,225
  • 5
  • 33
  • 45