0

I have a multi-user database management system of about 1 million records, its structure is as below:

  1. Backend (MySQL)
    • "DNames" table
      • "Fullname" field
      • "ID" field
  2. Frontend (MS Access)
    • "levenshtein" function
    • "lev" query
      • "lev_dist" field (calculated levenshtein distance using function above, sorted asc)
      • "Fullname" field
      • "ID" field
    • "srch" textbox in "result" form

My problem is that when I run the query (i.e. use "srch" textbox) without sorting it's fast enough, but when I use sort it takes about 30 to 90 sec to complete (depending on pc specs). I need the sort operation to find the top 10 (closest) match between the text in "srch" textbox and the database, so how can I speed up the process? Is there a way to make it reach 5 second max? This process may run from 5 PCs simultaneously. I tried using MySQL levenshtein function , yet it took 2 minuts!!

  • The problem is that mysql needs to calculate the distance for all records that satisfy your query and then sort the resultset accordingly. It cannot use any index, it cannot use any shortcuts. If you want to use text analytics like this, then I'm sorry to say that a traditional rdbms may not be the best tool for you. Reading the entire dataset into memory and performing the analysis there is the best course of action for datasets that fit into your memory. There are specific text analytics tools out there that can also speed calculations like this up for larger datasets. – Shadow Mar 24 '21 at 12:55
  • can you post the query and the levenshtein function / query – Bernd Buffen Mar 24 '21 at 13:17

1 Answers1

0

Would you accept a compromise? Find all words within a 'small' distance in perhaps 1ms (if the data is cached in the buffer_pool)?

  1. Build a table with about 5M-10M rows (based on your 1M 'words'). It would have two columns -- F(word), word.
  2. Lookup F(word) to get a list of possible words.

F(word) is a set of strings -- Take the 'word' and drop each letter, plus the original word. For example:

word --> ord, wrd, wod, wor, word
letter --> etter, ltter, leter, lettr. lette, letter

(Note that 'leter' occurs twice)

Table and query:

CREATE TABLE ricks_leven ()
    fword VARCHAR(22) NOT NULL,  -- F(word)
    word  VARCHAR(22) NOT NULL,  -- the desired word
    PRIMARY KEY(fword, word)
) ENGINE=InnoDB;

SELECT word, COUNT(*) AS ct
    FROM ricks_leven
    WHERE fword IN ('etter', 'ltter', 'leter', 'lettr'. 'lette', 'letter')
    GROUP BY word
    ORDER BY ct DESC
    LIMIT 10;

A perfect match will automatically come first in the output. Some other "likely" misspellings may come next. I don't know if the Levenshtein distance orders the results in the same way.

This algorithm covers these common typos, all of which have a small Levenshtein distance:

  • any one-letter drop,
  • adjacent letter transposition (distance=2, but important),
  • a letter being added in any location.

A compromise between speed and completeness:

  1. Use my technique. If you get some results, then quit.
  2. Fall back onto slow Levenshtein search.
Rick James
  • 135,179
  • 13
  • 127
  • 222