2

As stated by most spelling corrector tutors, the correct word W^ for an incorrectly spelled word x is:

W^ = argmaxW P(X|W) P(W)

Where P(X|W) is the likelihood and P(W) is the Language model.

In the tutorial from where i am learning spelling correction, the instructor says that P(X|W) can be computed by using a confusion matrix which keeps track of how many times a letter in our corpus is mistakenly typed for another letter. I am using the World Wide Web as my corpus and it cant be guaranteed that a letter was mistakenly typed for another letter. So is it okay if i use the Levenshtein distance between X and W, instead of using the confusion matrix? Does it make much of a difference?

The way i am going to compute Lev. distance in python is this:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

See this

And here's the tutorial to make my question clearer: Click here

PS. i am working with Python

Community
  • 1
  • 1
tenstar
  • 9,816
  • 9
  • 24
  • 45

2 Answers2

1

There are a few things to say.

  1. The model you are using to predict the most likely correction is a simple, cascaded probability model: There is a probability for W to be entered by the user, and a conditional probability for the misspelling X to appear when W was meant. The correct terminology for P(X|W) is conditional probability, not likelihood. (A likelihood is used when estimating how well a candidate probability model matches given data. So it plays a role when you machine-learn a model, not when you apply a model to predict a correction.)

  2. If you were to use Levenshtein distance for P(X|W), you would get integers between 0 and the sum of the lengths of W and X. This would not be suitable, because you are supposed to use a probability, which has to be between 0 and 1. Even worse, the value you get would be the larger the more different the candidate is from the input. That's the opposite of what you want.

  3. However, fortunately, SequenceMatcher.ratio() is not actually an implementation of Levenshtein distance. It's an implementation of a similarity measure and returns values between 0 and 1. The closer to 1, the more similar the two strings are. So this makes sense.

  4. Strictly speaking, you would have to verify that SequenceMatcher.ratio() is actually suitable as a probability measure. For this, you'd have to check if the sum of all ratios you get for all possible misspellings of W is a total of 1. This is certainly not the case with SequenceMatcher.ratio(), so it is not in fact a mathematically valid choice.

  5. However, it will still give you reasonable results, and I'd say it can be used for a practical and prototypical implementation of a spell-checker. There is a perfomance concern, though: Since SequenceMatcher.ratio() is applied to a pair of strings (a candidate W and the user input X), you might have to apply this to a huge number of possible candidates coming from the dictionary to select the best match. That will be very slow when your dictionary is large. To improve this, you'll need to implement your dictionary using a data structure that has approximate string search built into it. You may want to look at this existing post for inspiration (it's for Java, but the answers include suggestions of general algorithms).

Community
  • 1
  • 1
jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • so in conclusion can i use `SequenceMatcher.ratio()` for my purpose? – tenstar Jul 21 '13 at 09:46
  • @tenstar Yes. Sorry if that wasn't clear. My only real concern is that you'll get performance (read: speed) issues when your dictionary is large. – jogojapan Jul 21 '13 at 09:49
  • yes, but if i generate only a few candidates with top probabilities then i can tackle the performance issues rite? – tenstar Jul 21 '13 at 09:50
  • @tenstar Yes, that's right. (In that case, I wonder if the method you use to generate these candidates could not perhaps be modified so it generates some kind of similarity score along with each candidate. If so, you wouldn't need `SequenceMatcher.ratio()` any more.) – jogojapan Jul 21 '13 at 09:53
  • so u mean that me solution of generating few candidates will work, even if i am building something similar to google? – tenstar Jul 21 '13 at 09:55
  • Oh wait, maybe I misunderstood. How exactly do you generate a shortlist of candidates? After you previous comment I thought you have some kind of smart dictionary lookup that gives you only a few candidates, and _then_ you compute the `SequenceMatcher.ratio()` for them. But if you apply `SequenceMatcher.ratio()` to _all_ entries of the dictionary, it will be very slow. – jogojapan Jul 21 '13 at 09:58
  • i wont apply it to *all* entries, i will apply it only to the shorter list of candidates, but it will be fast even if i use it for google-scale app rite? – tenstar Jul 21 '13 at 10:00
  • Hmmm. First of all, of course it depends on how many candidates there are in the short list. But more importantly, for real Google-scale you are dealing with thousands of queries per second. Scaling to that amount means distributing across multiple servers. You can reduce the number of servers if you have a super-fast implementation, and even though `SequenceMatcher.ratio()` might be pretty fast (I haven't measured), I am sure there is plenty room for optimization. So, yes, if you have enough hardware, you could get Google scale, but with more optimization it would be much less expensive. – jogojapan Jul 21 '13 at 10:08
  • yes, thankyou, i'll accept ur question, by the way, i hope u saw the video tutorial i was referring to before answering my question, so that u understand what i mean.. – tenstar Jul 21 '13 at 10:11
  • @tenstar I haven't watched _all_ of them. But there is one in the spelling correction section that describes what they call the channel model, and possible alternatives. Of course `SequenceMatcher.ratio()` won't enable _all_ of the things they mention. The only thing I am saying above is that what you suggest is a possible and practical way of implementing that model, and that for all I can tell, results will be fairly good. – jogojapan Jul 21 '13 at 10:24
  • @jogojapan P(X|W) is widely known in the literature as the Likelihood of data X given model W - I don't think you can assert that "conditional probability" is the correct and by implication only terminology here. "Conditional probability" doesn't discern what is being conditioned on and is ambiguous (it could also refer to the Posterior - which is in fact I think what the OP really wants to determine, unless the priors P(W) are uniform etc. - or indeed any other conditional probability in the Universe.) [Sorry to be picky - I am not trying to start a fight but rather to provide clarity.] – jtlz2 Aug 05 '19 at 09:42
0

Yes, it is OK to use Levenshtein distance instead of the corpus of misspellings. Unless you are Google, you will not get access to a large and reliable enough corpus of misspellings. There any many other metrics that will do the job. I have used Levenshtein distance weighted by distance of differing letters on a keyboard. The idea is that abc is closer to abx than to abp, because p is farther away from x on my keyboard than c. Another option involves accounting for swapped characters- swap is a more likely correction of sawp that saw, because this is how people type. They often swap the order of characters, but it takes some real talent to type saw and then randomly insert a p at the end.

The rules above are called error model- you are trying to leverage knowledge of how real-world spelling mistakes occur to help with your decision. You can (and people have) come with really complex rules. Whether they makes a difference is an empirical question, you need to try and see. Chances are some rules will work better for some kinds of misspellings and worse for others. Google how does aspell work for more examples.

PS All of the example mistakes above have been purely due to the use of a keyboard. Sometime, people do not know how to spell a word- this is whole other can of worms. Google soundex.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79