1

I currently contribute to a certain wiki database that currently has about 500000 entries. Some of these entries have keywords attached. Due to a previous lack of restrictions on the database, people have often misspelled keywords when typing them in, thus creating new, misspelled instances of already-existing keywords.

I'd like to look through the list of keywords to find these instances. However:

  • The list is about 4500 keywords long, so manually checking is out.
  • Many keywords are obscure, very technical, or named after people, so checking them against a dictionary isn't going to be much use.
  • Since many keywords are obscure/very technical, this means they only occur on a few entries each in the entire database. By my estimate, about 80% of these keywords appear on fewer than 10 entries, and about half of the keywords appear on just one entry. So looking at keywords that appear on only a few entries (my initial thought, which is in part why I have these estimates) is still inefficient.

The only other solution I came up with is to scan the entire list for similar pairs of strings. Googling brought me to the Levenshtein distance and its relative, the Damerau-Levenshtein distance. While I could naively check every single pair (this is O(n^(2)m^(2)), where n is the number of keywords and m is the maximum length of the keyword), I was wondering if there were any more-suitable approaches before I code this up tomorrow.

Since I suspect I'll need to run this many times, tinkering with the code to remove false positives, probably adding weights to some specific edits and removing weights from others, efficiency may be a problem.

This is a similar question to the one posed in this answer, but I couldn't immediately find an answer to it anywhere.

Obviously, any alternative approaches not involving the Levenshtein distance would be welcome too.

edderiofer
  • 19
  • 1
  • It's a concrete algorithm question, perfectly reasonable. – David Eisenstat Jun 26 '21 at 22:28
  • Mark's answer is what I would suggest for a beginner implementing from scratch, but there are also approaches that involve constructing DFAs/tries/DAWGs, e.g., http://stevehanov.ca/blog/index.php?id=114 – David Eisenstat Jun 26 '21 at 22:40

2 Answers2

1

Random thought: Part of the problem is the Levenshtein computation, so maybe there's a simpler function that's conservative proxy, e.g. word-length. A slightly more complicated function is mapping each word into a 26 element vector reflecting counts of characters in word, then comparing vectors using euclidean distance.

Another random thought: Map the words into char. count vectors as above, then find clusters and only consider pairs in each cluster.

Mark Lavin
  • 1,002
  • 1
  • 15
  • 30
  • 1
    To make this fast you're going to want an efficient approximate nearest neighbors library for those vectors, e.g., https://pypi.org/project/scann/ – David Eisenstat Jun 26 '21 at 22:30
0

I suggest choosing some small number k (e.g., k = 3) and then mapping each keyword to the set of k-tuples it contains, e.g.:

Word: widget

3-tuples:
      wid
       idg
        dge
         get

You can then build a k-tuple index that maps any given k-tuple back to the list of keywords that contain it. This can be implemented as a hashtable, or as an array of 26^k elements. Intuitively, word pairs that have low Levenshtein distance will share a large fraction of k-tuples, so to find the approximate nearest neighbours of any given keyword S, it suffices to determine the constituent k-tuples for S, look up the corresponding lists from the index, and form the union of them.

Choosing higher k gives smaller lists to check, but can miss some matches; choosing k too low means you will spend a lot of time wading through false positives (keywords that share a k-tuple but are nevertheless dissimilar). A trick that is usually a win is to completely discard some of the most common k-tuples: e.g., a k-tuple that appears in 80% of the keywords is not very informative, and checking all those words for each of 80% of the words takes us back to essentially quadratic time.

Refinements of this approach are used in bioinformatics to quickly map DNA sequence reads to known reference genomic sequences.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169