41

Where can I find some real world typo statistics?

I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:

  1. typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.
  2. Spelling - "Shikago" instead of "Chicago"

I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).

I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".

Examples:

  • I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
  • "Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
  • Unicode transliterations: What is the "real" distance between "München" and "Munchen"?

What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?

Even Norvig's very cool spell corrector uses non-weighted edit distance.

BTW- I'm sure the weights need to be functions and not simple floats (per the above examples)...

I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...

Should I just guess them?

EDIT - trying to answer user questions:

  • My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
  • I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
  • Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
  • "How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
  • Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
  • The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
  • Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Tal Weiss
  • 8,889
  • 8
  • 54
  • 62
  • Perhaps MS has conducted a study (though Word's spell correction is not nearly as intelligent, in fact I think it really just checks each spelling against a list of common errors). Also, Google is pretty committed to open source dev, perhaps they'll give you such data if you ask nicely? – Rafe Kettler Aug 05 '10 at 21:49
  • 1
    That Google-scale data is interesting. Is it something one could access and query or is that just an example page? – Caleb Hearth Aug 05 '10 at 21:50
  • 2
    It might help if you took key proximity into account somehow, in your weighting. Typing Hellp is more likely to happen than Hellz because the q key is close to the "correct" o key (assuming QWERTY...) – Jason Hall Aug 06 '10 at 05:20
  • 2
    Although I agree that typo frequency would be useful, finding that frequency data will be difficult, since it's inherently subjective. The problem with "real world" probabilities is that the "real world" is a very big place. Kids in elementary school will have a very different mistake frequency distribution than middle-aged women working in accounts receivable, which will in turn be different than college English professors. Finding the "average" that is appropriate for your problem domain will be no easy task. – Cerin Aug 06 '10 at 13:22
  • That is unless you're on twitter, the great equalizer. – VoronoiPotato Sep 11 '13 at 18:59

5 Answers5

14

Possible source for real world typo statistics would be in the Wikipedia's complete edit history.

http://download.wikimedia.org/

Also, you might be interested in the AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T

tszming
  • 2,084
  • 12
  • 15
8

I would advise you to check the trigram alogrithm. In my opinion it works better for finding typos then edit distance algorithm. It should work faster as well and if you keep dictionary in postgres database you can make use of index.

You may find useful stackoverflow topic about google "Did you mean"

Community
  • 1
  • 1
jethro
  • 18,177
  • 7
  • 46
  • 42
5

Probability Scoring for Spelling Correction by Church and Gale might help. In that paper, the authors model typos as a noisy channel between the author and the computer. The appendix has tables for typos seen in a corpus of Associated Press publications. There is a table for each of the following kinds of typos:

  • deletion
  • insertion
  • substitution
  • transposition

For example, examining the insertion table, we can see that l was incorrectly inserted after l 128 times (the highest number in that column). Using these tables, you can generate the probabilities you're looking for.

mndrix
  • 3,131
  • 1
  • 30
  • 23
2

If the research is your interest I think continuing with that algorithm, trying to find decent weights would be fruitful.

I can't help you with typo stats, but I think you should also play with python's difflib. Specifically, the ratio() method of SequenceMatcher. It uses an algorithm which the docs http://docs.python.org/library/difflib.html claim is well suited to matches that 'look right', and may be useful to augment or test what you're doing.

For python programmers just looking for typos it is a good place to start. One of my coworkers has used both Levenshtein edit distance and SequenceMatcher's ratio() and got much better results from ratio().

JivanAmara
  • 1,065
  • 2
  • 10
  • 20
1

Some questions for you, to help you determine whether you should be asking your "where do I find real-world weights" question:

Have you actually measured the effectiveness of the uniform weighting implementation? How?

How many different "internal objects" do you have -- i.e. what is the size of your dictionary?

How are you actually using the edit distance e.g. John/Joan, Marmaduke/Marmeduke, Featherstonehaugh/Featherstonhaugh: is that "all 1 error" or is it 25% / 11.1% / 5.9% difference? What threshold are you using?

How many pairs of dictionary entries are within your threshold (e.g. John vs Joan, Joan vs Juan, etc)? If you introduced a fancy weighting system, how many pairs of dictionary entries would migrate (a) from inside the threshold to outside (b) vice versa?

What do you do if both John and Juan are in your dictionary and the user types Joan?

What are the penalties/costs of (1) choosing the wrong dictionary word (not the one that the user meant) (2) failing to recognise the user's input?

Will introducing a complicated weighting system actually reduce the probabilities of the above two error types by sufficient margin to make the complication and slower speed worthwhile?

BTW, how do you know what keyboard the user was using?

Update:

"""My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance)."""

Yes, Thursday -> Tursday by omitting an "h", but Tuesday -> Tursday by substituting "r" instead of "e". E and R are next to each other on qwERty and azERty keyboards. Every "real person" can easily guess that Thursday is more likely than Tuesday. Even if statistics as well as guesses point to Thursday being more likely than Tuesday (perhaps omitting h will cost 0.5 and e->r will cost 0.75), will the difference (perhaps 0.25) be significant enough to always pick Thursday? Can/will your system ask "Did you mean Tuesday?" or does/will it just plough ahead with Thursday?

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Good questions. Some of the answers I've left out on purpose to make the question a little more generic and of interest to other users... Anyway, I'll edit the question to try and answer them. – Tal Weiss Aug 06 '10 at 04:38
  • I don't know which keyboard the user used, but I'm positive that QWERTY variants are more common than, say, Dvorak. – Tal Weiss Aug 06 '10 at 11:30