0

I'm working on a query which I thought should be quite intuitive, but somehow I'm facing a bit of issues when implementing it. I guess what I'm trying to achieve is to match a string stored in MySQL DB without space and punctuation (other creative approaches are more than welcome). At the same time I would like the query to handle Unicode characters in diacritics insensitive fashion (so options like REGEXP are kinda out of luck). And the last condition is I'm on MySQL 5.5 with InnoDB engine, so full-text indexing is not supported (but I'm open to upgrade to 5.6/5.7 if it helps sorting this out).

Consider the scenario which the string Hello-World from John Doe is stored in DB. I would like to find it when given the search string HelloWorld or JohnDoe. To be more general, the string in DB can contain brackets, understores and any other punctuation (not limited to ASCII but can compromise for now), while the search string can be a combination of words with or without any separators in between. The closest I've gotten so far is to daisy chain the REPLACE function for a list of known punctuation, like below:

SELECT text FROM table WHERE REPLACE(REPLACE(text, '-', ''), ' ', '') LIKE '%JohnDoe%'

My questions are:

  1. Is there a better way instead of using the daisy chain above?
  2. If that's the only solution, how will the performance be impacted when I chain up hundred or more REPLACE functions?

Thanks in advance for your help.

CLDev
  • 1,467
  • 1
  • 15
  • 19

3 Answers3

4

I don't know how restrictive your searches must be, but you could try to strip out all non-alphanumeric characters from it, so that you end up with a string like "HelloWorldfromJohnDoe" that you match with instead.

Have a look at this answer: How to remove all non-alpha numeric characters from a string?

You might have to change it around a bit though to make it fir your purposes. I changed it from CHAR(32) to CHAR(255) to make sure I could get the column, but you might want to look into changing the function altogether to fit your data more precisely.

Then you something like this:

SELECT *
FROM testing
WHERE alphanum(test) LIKE CONCAT('%', alphanum('John Doe'), '%')

which should give you a hit.

Community
  • 1
  • 1
user2620460
  • 168
  • 1
  • 1
  • 9
  • Thanks for your pointer. In fact I've looked into that thread, unfortunately I've to deal with CJK characters as well. I'll give a try to modify the function to get rid of **[[:punct:]]** instead of keeping **[[:alnum:]]**, and see if it works. – CLDev Jan 09 '17 at 03:11
1

Method 1

I would have another column on the schema containing an "hashed" version of the name, for example, let's say you have the user:

John Doe The Great

This name hashes to

johndoethegreat

The hash function is coded in such a way that all of the following strings:

John_Doe_THE_great
John Doe The GREAT
John.Doe.The.Great
johnDOE___theGreat
john   Doe   the     great
___john____DOE____THE____great

hash to the same value

johndoethegreat

It's trivial to write such a function. This way you can get the user input, hash it and then compare it against the hash column in your database

Names like:

Jon Doe
John Doo

will not be found of course

Method 2

Use the FULLTEXT search feature built-in in MySQL, sort the results by score and pick the first non zero entry

http://blog.oneiroi.co.uk/mysql/php/mysql-full-text-search-with-percentage-scoring/

Gianluca Ghettini
  • 11,129
  • 19
  • 93
  • 159
0

I am totally missing the point of your question. You appear to have the string:

  • Hello-World from John Doe

If you want to find this when the search string is JohnDoe or John Doe, then you only need to substitute spaces:

where replace(text, ' ') like concat('%', 'JohnDoe', '%')

If you want a string that contains both "John" and "Doe" in that order, then:

where replace(text, ' ') like concat('%', 'John%Doe', '%')

I fail to see why 100 nested replace()s would be needed.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • The problem is, the separator can be any punctuation, not limited to hyphen or space. Some records could have understores, some records could have brackets, say **John_Doe** and **John (Doe)**. I want all these records to be matched by the search string **JohnDoe**, hope this makes sense to you. – CLDev Jan 09 '17 at 03:01
  • @CLDev . . . And that is what `like '%John%Doe%'` handles. – Gordon Linoff Jan 09 '17 at 03:04
  • Agreed, but what if the user who does the input doens't know about the context and put **JohnDoe**? (I don't have control on the user input, i.e. the search string.) Then I'll have to tokenize the phrase? – CLDev Jan 09 '17 at 03:07
  • @CLDev . . . At some point, you have to trust the users to know what they are looking for. – Gordon Linoff Jan 09 '17 at 03:13
  • It's not a matter whether I trust the users or not, but I'm experiencing this as a practical problem... There are place names or word phrases that some people add hyphens (or other punctuation) but some don't. For example, I see some people type **re-engineering** while some type **reengineering**. So I'm now given the latter and some users expect to get articles with the former in return. – CLDev Jan 09 '17 at 03:20