2

I have a simple application that allows users to enter a string to search for a name in a database. The server side is ColdFusion 7. The issue I'm having is that a query such as "obrien" does not return entries with the name "o'brien."

I think what I want is fuzzy matching capability. After doing some research I've also come across full text search which might be what I'm looking for; however, I'm not sure about the difference between the two. ColdFusion has a service called verity, but it seems I have to first query all of the database, and then index it--this sounds very costly.

Is there a built-in way to do fuzzy matching or full text search in ColdFusion without first querying the entire database? If not, when doing full text search, do I have to specify the indexes? For example, obrien should index to "obrien, o'brien, o'brein"?

Miguel-F
  • 13,450
  • 6
  • 38
  • 63
perry_VW
  • 145
  • 3
  • 14
  • What DB backend are you using?# – diagonalbatman Apr 09 '13 at 14:56
  • @diagonalbatman db2 on a mainframe – perry_VW Apr 09 '13 at 15:08
  • 3
    A google search on "db2 fuzzy search" led me to IBM documentation on the topic. – Dan Bracuk Apr 09 '13 at 15:11
  • Out of interest - if you had someone called Digby-Jones, what would you be searching for ? – Hugh Jones Apr 09 '13 at 15:43
  • @DanBracuk is right I reckon. Don't try to do this in ColdFusion, do it on the database. Have a look at how to do full-text searching on DB2, and give that a go. If you get stuck, revise the question or raise a new one. The full-text search will almost certainly tokenise o'brien and obrien the same way, and will probably deal with misspells as well. DO NOT USE VERITY. It is dead technology, and it a right pain to use in ColdFusion anyhow. – Adam Cameron Apr 10 '13 at 04:33
  • Your performance is going to be nightmarish if you need you have hundreds of thousands or millions of rows of data and try to do this without an inverted index. How much data are we talking about and is wait time an issue? – J.T. Apr 10 '13 at 12:02
  • @AdamCameron, I gave that a look. When trying to implement it, I got an error that I think meant full text search wasn't enabled on this database. JT, somewhere between 50-100k records and wait time is an issue. – perry_VW Apr 10 '13 at 13:55

4 Answers4

3

What about trying to use the SOUNDEX founction? I don't think there is any easy answer here.

Cyril Hanquez
  • 686
  • 3
  • 6
  • Soundex will work if all he needs to handle are apostrophes, however, it lets a ton of noise in. – J.T. Apr 10 '13 at 11:56
  • I ended up using soundex. I have generally have enough names provided by the user to narrow down the results significantly. – perry_VW Apr 11 '13 at 16:52
2

check below queries,

select * 
from tablename 
where  Replace(ColumnName,'''','') like 'obrien'

OR

select * 
from tablename 
where  Replace(ColumnName,'''','') like '%obrien%'
genericHCU
  • 4,394
  • 2
  • 22
  • 34
2

You are looking for fulltextsearch with fuzzy matching capabilities, more importantly you are looking for the built-in tokenizers that these systems offer. However, you can also use a SQL Function that will do Levenshtein edit distance matching.

Verity or any other information retrieval system such as Lucene would be a solution, but you'd have to constantly reindex your data from the raw data if you need real time search. If you don't need to repopulate Verity very often, this could be a good choice (bless your heart for dealing with it though). Tokenizers handle formatting the data from the database and the user so that you end up with identical strings. Also Verity is old in ColdFusion 7, so I imagine it's fuzzy matching options are pretty limited compared to what you would find in Lucene 4.

The other option is to use a a function in sql for levenshtein edit distance algorithm. This would basically tell you how many edits it takes to get from one string to the other. This is "fuzzy" matching. For instance to turn lcase(O'brien) into lcase(obrien) it takes one edit, the removal of the "'". One edit of a six length string is a good match. To turn "hugh" into "jones" it would take five edits (changing every letter,adding a letter). Five edits of a length 4 string is not a very good match. Levenshtein distance in T-SQL

Finally, for any database with fulltext search, this will give you live data with some tokenizing like verity. It's nice because it's so, so simple to setup. I'm not sure if DB2 supports fuzzy matching. SQL Server does not.

Community
  • 1
  • 1
J.T.
  • 2,606
  • 15
  • 31
  • really good explanation. I understand now why he's not looking for apostrophes ;P. Out of interest - how many edits does it take to get from "How to return query results containing apostrophe from search string not containing apostrophe" to "How to achieve fuzzy matching in DB2 ?" – Hugh Jones Apr 10 '13 at 12:04
  • Just one edit! For some reason I thought Hugh Jones was the OP. Hah. – J.T. Apr 10 '13 at 12:06
  • This may be a stupid question, but does fulltext search need to be enabled on db2, or is it just a capability always present. If it needs to be enabled, does the user have to provide any info as far as the indexes (that obrien, o'brein, o'bien, etc should all index to o'brien)? – perry_VW Apr 10 '13 at 14:01
  • 1
    On SQL Server you first setup a full text catalog. This takes about three minutes. From that point, you can simply create a new full text index on every table that needs it and the t-sql is extended with a couple of full text functions for that table. This creates an inverted index. An inverted index holds all unique words (or tokens) that are found in your dataset and then records where those words are found. It's nothing like a standard database index. – J.T. Apr 10 '13 at 14:19
-2

You are going down the wrong track here - its not a fuzzy match you need, its a parameterised query.

You can set the value of the parameter to "O'Brien" and the parameter will understand the function of the embedded apostrophe.

Hugh Jones
  • 2,706
  • 19
  • 30
  • 1
    the value "o'brien" is in the database but the user searches for "obrien". 0 results because "O'brien" <> "obrien" – genericHCU Apr 09 '13 at 15:25
  • 4
    You may be right, but this looks suspiciously like someone trying to avoid the headache of apostrophes in names. If the guy is called O'Brien then isn't that the name we should search for? I could be wrong ... – Hugh Jones Apr 09 '13 at 15:37
  • Yeah, I think @travis has it. The problem is that the user could query either "obrien" or "o'brien." That's just one example, it could happen with any name with an apostrophe. The problem is that the user might not know if that person's name has an apostrophe or not. I'd like to return a name to them regardless of whether they know it contains an apostrophe. – perry_VW Apr 09 '13 at 16:01
  • @HughJones, I agree with your statement, "I could be wrong". The OP was quite specific about a fuzzy search. – Dan Bracuk Apr 09 '13 at 16:17
  • @perry_VW - ok then fuzzy matching it is - the apostrophe is just an example of a possible misspelling and not an invalid character in a UI as I had suspected. I retract. – Hugh Jones Apr 09 '13 at 16:25