2

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.

For example, we have a paragraph of text like:

"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"

We have a table of keywords to match against this and have been using the basic system of:

                    SELECT
                        sk.ID
                    FROM
                        sit_keyword sk
                    WHERE
                        var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')

It works fine 90% of the time, but it completely fails on:

C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.

I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.

Any advice on the best way to do this using REGEXP would be appreciated.

Floobinator
  • 388
  • 2
  • 11

2 Answers2

1

You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:

var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')

The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.

More details on adaptive dynamic word boundaries and demo in my YT video.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Not sure if the replacement in the escaping function works in the current form, if it does not, double the backslashes, ``REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\\\$1')`` – Wiktor Stribiżew Aug 18 '22 at 19:32
  • 1
    This is EXACTLY what I was looking for - THANK YOU! And thank you for the link to your YT video - very informative! Extra note - what I did was create a virtual stored column for all the root keywords and such that escaped them with REGEXP_REPLACE so when I ran the actual check, it was lightning fast - ensuring I checked against an already escaped value vs. having to compute it for each iteration. – Floobinator Aug 18 '22 at 21:40
0

Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax

If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.

This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():

WHERE LOCATE(sk.keyword, var_text) > 0

This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828