2

I recently implemented the UDFs of the Damerau–Levenshtein algorithms into MySQL, and was wondering if there is a way to combine the fuzzy matching of the Damerau–Levenshtein algorithm with the wildcard searching of the Like function? If I have the following data in a table:

ID | Text
---------------------------------------------
1  | let's find this document
2  | let's find this docment
3  | When the book is closed
4  | The dcument is locked

I want to run a query that would incorporate the Damerau–Levenshtein algorithm...

select text from table where damlev('Document',tablename.text) <= 5;

...with a wildcard match to return IDs 1, 2, and 4 in my query. I'm not sure of the syntax or if this is possible, or whether I would have to approach this differently. The above select statement works fine in issolation, but is not working on individual words. I would have to change the above SQL to...

select text from table where 
 damlev('let's find this document',tablename.text) <= 5;

...which of course returns just ID 2. I'm hoping there is a way to combine the fuzzy and wildcard together if I want all records returned that have the word "document" or variations of it appearing anyway within the Text field.

user1236443
  • 549
  • 2
  • 8
  • 19

1 Answers1

3

In working with person names, and doing fuzzy lookups on them, what worked for me was to create a second table of words. Also create a third table that is an intersect table for the many to many relationship between the table containing the text, and the word table. When a row is added to the text table, you split the text into words and populate the intersect table appropriately, adding new words to the word table when needed. Once this structure is in place, you can do lookups a bit faster, because you only need to perform your damlev function over the table of unique words. A simple join gets you the text containing the matching words. enter image description here

A query for a single word match would look something like this:

SELECT T.* FROM Words AS W
JOIN Intersect AS I ON I.WordId = W.WordId
JOIN Text AS T ON T.TextId = I.TextId
WHERE damlev('document',W.Word) <= 5 

and two words would look like this (off the top of my head, so may not be exactly correct):

SELECT T.* FROM Text AS T
JOIN (SELECT I.TextId, COUNT(I.WordId) AS MatchCount FROM Word AS W
      JOIN Intersect AS I ON I.WordId = W.WordId
      WHERE damlev('john',W.Word) <= 2
            OR damlev('smith',W.Word) <=2
      GROUP BY I.TextId) AS Matches ON Matches.TextId = T.TextId
          AND Matches.MatchCount = 2

The advantages here, at the cost of some database space, is that you only have to apply the time-expensive damlev function to the unique words, which will probably only number in the 10's of thousands regardless of the size of your table of text. This matters, because the damlev UDF will not use indexes - it will scan the entire table on which it's applied to compute a value for every row. Scanning just the unique words should be much faster. The other advantage is that the damlev is applied at the word level, which seems to be what you are asking for. Another advantage is that you can expand the query to support searching on multiple words, and can rank the results by grouping the matching intersect rows on TextId, and ranking on the count of matches.

  • In fact my text field can contain over 10000 characters, so potentially a lot of words. Not sure I understand. Would I have to take each row and split it into its separate words? – user1236443 Jan 10 '13 at 16:00
  • If your text is that long, I'm not sure how ideal my suggestion would be without testing. Yes, each row has its text split to words, and one row is added to the intersect table for each word. I will add a diagram to my answer to make it more clear. – hatchet - done with SOverflow Jan 10 '13 at 16:08
  • Looks like a good solution. What about when you are looking for two words separated by a space, as in a Person's name. Does the technique still apply if you say wanted to find "John Smith" and/or "Jon Smith". – user1236443 Jan 10 '13 at 16:57
  • In my own use, I don't try to do it all in the database. I use the database to provide a relatively small set of candidates. Then I evaluate and score the candidates using more involved algorithms. For example, people commonly transpose names, merge dual last names, and generally mangle them in a multitude of ways that are difficult to handle in the database alone. Regarding your specific example, I'll add another query to the answer. – hatchet - done with SOverflow Jan 10 '13 at 17:59
  • So do you build your Words, Intersect table outside of MySQL? Are there purpose algorithms for doing this? I was hoping to try and keep everything within MySQL but not sure how viable this is. – user1236443 Jan 31 '13 at 13:28
  • @user1236443 - We use Sql Server, and it's all done on the server with user defined functions, stored procedures and triggers. I think the same thing could be done on MySQL. The main thing is writing the SQL to split words. A google search will show many splitWord functions floating around. The work is done via triggers on the table containing the full text. – hatchet - done with SOverflow Jan 31 '13 at 14:34
  • Thanks. I have a procedure that will split strings based on a delimiter. Just thinking of the best way to to the intersect table. – user1236443 Jan 31 '13 at 14:50
  • I wonder then - first step split words by spaces and do a distinct to get a unique set of words. Second step, I guess within a loop take the next word in your list and maybe an insert into Intersect with a LIKE search in Text showing ID of Text and ID of Word and perhaps build it up that way. – user1236443 Jan 31 '13 at 14:53
  • @user1236443 Split into words, insert into Words table any words that don't already exist, then insert the intersect rows. The insert and delete triggers on the text table are straightforward. For updates, it's simpler to do the actions of a delete followed by the actions of an insert rather than trying to figure out which intersect rows stay, which get deleted, and which need adding. It's a little extra work for the database, but if updates are infrequent, not a problem, which was the case for us. – hatchet - done with SOverflow Jan 31 '13 at 17:31
  • I have two ideas to rise the performance: 1.) Add a column with the string length to the word table and compare only those words that could be part of `WHERE damlev('document',W.Word) <= 5` – mgutt Aug 10 '14 at 08:27
  • 2.) Add 10 (depends on the typical word len of the lang) columns and fill them with unicode code points representing all used chars ( http://stackoverflow.com/a/11541879/318765 ). E.g. "car" will use 3 columns filled with 6300, 6100 and 7200. Now you can compare only those words that have a minimum of one (or two) matching chars. E.g. "red" will search within "car" because of the used char "r", but "bye" won't. You don't need as much columns as the word len because very long words could be matched by the first idea. Of course you could use 1-26 representing a-z (depends on used lang). – mgutt Aug 10 '14 at 08:56