I currently contribute to a certain wiki database that currently has about 500000 entries. Some of these entries have keywords attached. Due to a previous lack of restrictions on the database, people have often misspelled keywords when typing them in, thus creating new, misspelled instances of already-existing keywords.
I'd like to look through the list of keywords to find these instances. However:
- The list is about 4500 keywords long, so manually checking is out.
- Many keywords are obscure, very technical, or named after people, so checking them against a dictionary isn't going to be much use.
- Since many keywords are obscure/very technical, this means they only occur on a few entries each in the entire database. By my estimate, about 80% of these keywords appear on fewer than 10 entries, and about half of the keywords appear on just one entry. So looking at keywords that appear on only a few entries (my initial thought, which is in part why I have these estimates) is still inefficient.
The only other solution I came up with is to scan the entire list for similar pairs of strings. Googling brought me to the Levenshtein distance and its relative, the Damerau-Levenshtein distance. While I could naively check every single pair (this is O(n^(2)m^(2)), where n
is the number of keywords and m
is the maximum length of the keyword), I was wondering if there were any more-suitable approaches before I code this up tomorrow.
Since I suspect I'll need to run this many times, tinkering with the code to remove false positives, probably adding weights to some specific edits and removing weights from others, efficiency may be a problem.
This is a similar question to the one posed in this answer, but I couldn't immediately find an answer to it anywhere.
Obviously, any alternative approaches not involving the Levenshtein distance would be welcome too.