3

Recently I've looked through several spell checker algorithms including simple ones(like Peter Norvig's) and much more complex (like Brill and Moore's) ones. But there's a type of errors which none of them can handle. If for example I type stackoverflow instead of stack overflow these spellcheckers will fail to correct the mistype (unless the stack overflow in the dictionary of terms). Storing all the pairs of words is too expensive (and it will not help if the error is 3 single words without spaces between them). Is there an algorithm which can correct (despite usual mistypes) this type of errors?

Some examples of what I need:
spel checker -> spell checker
spellchecker -> spell checker
spelcheker -> spell checker

StuffHappens
  • 6,457
  • 13
  • 70
  • 95

3 Answers3

2

I hacked up Norvig's spell corrector to do this. I had to cheat a bit and add the word 'checker' to Norvig's data file because it never appears. Without that cheating, the problem is really hard.

expertsexchange expert exchange
spel checker spell checker
spellchecker spell checker
spelchecker she checker  # can't win them all
baseball base all  # baseball isn't in the dictionary either :(
hewent he went

Basically you need to change the code so that:

  • you add space to the alphabet to automatically explore the word breaks.
  • you first check that all of the words that make up a phrase are in the dictionary to consider the phrase valid, rather than just dictionary membership directly (the dict contains no phrases).
  • you need a way to score a phrase against plain words.

The latter is the trickiest, and I use a braindead independence assumption for phrase composition that the probability of two adjacent words is the product of their individual probabilities (here done with sum in log prob space), with a small penalty. I am sure that in practice, you'll want to keep some bigram stats to do that splitting well.

import re, collections, math

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
  counts = collections.defaultdict(lambda: 1.0)
  for f in features:
    counts[f] += 1.0
  tot = float(sum(counts.values()))
  model = collections.defaultdict(lambda: math.log(.1 / tot))
  for f in counts:
    model[f] = math.log(counts[f] / tot)
  return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz '

def valid(w):
  return all(s in NWORDS for s in w.split())

def score(w):
  return sum(NWORDS[s] for s in w.split()) - w.count(' ')

def edits1(word):
  splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
  deletes    = [a + b[1:] for a, b in splits if b]
  transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
  replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
  inserts    = [a + c + b     for a, b in splits for c in alphabet]
  return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
  return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if valid(e2))

def known(words): return set(w for w in words if valid(w))

def correct(word):
  candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
  return max(candidates, key=score)

def t(w):
  print w, correct(w)

t('expertsexchange')
t('spel checker')
t('spellchecker')
t('spelchecker')
t('baseball')
t('hewent')
Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
2

This problem is very similar to the problem of compound splitting as applied to German or Dutch, but also noisy English data. See Monz & De Rijke for a very simple algorithm (which can I think be implemented as a finite state transducer for efficiency) and Google for "compound splitting" and "decompounding".

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
1

I sometimes get such suggestions when spell-checking in kate, so there certainly is an algorithm that can correct some such errors. I am sure one can do better, but one idea is to split the candidate in likely places and check whether close matches for the components exist. The hard part is to decide what are likely places. In the languages I'm sort of familiar with, there are letter combinations that occur rarely in words. For example, the combinations dk or lh are, as far as I'm aware rare in English words. Other combinations occur often at the start of words (e.g. un, ch), so those would be good guesses for splitting too. In the example spelcheker, the lc combination is not too widespread, and ch is a common start of words, so the split spel and cheker is a prime candidate, and any decent algorithm would then find spell and checker (but it would probably also find spiel, so don't auto-correct, just give suggestions).

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431