1

I am creating a small language dictionary application that requires the user to input a word (in either two languages) and have the most relevant results (Out of 10 or more) show up. My client wants this to account for spelling errors so I'm using the Damerau-Levenshtein distance formula. As an example here is a snippet of what the application should do:

MySQL fields - Term1 -> Value1 | Term2 -> Value2
Implementation - English Term -> English Value | German Term -> German Value
----------

forge -> to forge your parent's signature | fälschen -> die unterschrift de eltern fälschen
    Synonyms: fake, imitation, etc,
fake -> to fake your parent's signature | fälschen -> die unterschrift de eltern fälschen
    Synonyms: forge, imitation, etc,
black out -> to black out a classroom (with blinds) | verdunkeln -> (einen klassenraum) verdunkeln

Since fake and forge are similar I want the search results for forge (or foreg etc.) to return the two. I have a crude implementation working right now that will search through every row of a large database but it is a long process and I need a better system.

As additional information I am using the Moby Thesaurus to find synonyms of each word returned. To cut down on bulk synonyms (as nearly 20 results are placed in the array) I will probably strip entries that aren't found in the database.

Anyways, what I'm trying to find out is if there is a faster, more efficient method for search the database than what I am using right now. I hope I am clear enough, if not feel free to ask me more.

Many thanks!

Keiran Lovett
  • 606
  • 2
  • 10
  • 28

1 Answers1

1

Damerau-Levenshtein distance is an algorithm that can't really be optimized by a precomputed index. So, you're going to have some trouble making it faster in a DBMS context. (There are some tricks that make it possible to compare a single word to a lexically organized dictionary, but they're pretty exotic).

However, if you can retrieve a subset of the contents of your thesaurus table, and THEN use a distance algorithm, you might win.

For the first step try SOUNDEX (a sloppy and very cheap sound-alike match algorithm) that's built into mySQL. Or, if that casts too wide a net, you might look up the Metaphone or Double Metaphone algorithm.

Then for the second step do what you're already doing with a distance algorithm.

Check out this question and several answers. How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

Community
  • 1
  • 1
O. Jones
  • 103,626
  • 17
  • 118
  • 172