7

I am trying to get data from MySQL database via REGEX with or without special utf-8 characters.

Let me explain on example :

If user enters word like sirena it should return rows which include words like sirena,siréna,šíreňá .. and so on.. also it should work backwards when he enters siréná it should return the same results..

I am trying to search it via REGEX, my query looks like this :

SELECT * FROM `content` WHERE `text` REGEXP '[sšŠ][iíÍ][rŕŔřŘ][eéÉěĚ][nňŇ][AaáÁäÄ0]'

It works only when in database is word sirena but not when there is word siréňa..

Is it because something with UTF-8 and MySQL? (collation of mysql column is utf8_general_ci)

Thank you!

Maarty
  • 1,140
  • 1
  • 12
  • 27

3 Answers3

6

MySQL's regular expression library does not support utf-8.

See Bug #30241 Regular expression problems, which has been open since 2007. They will have to change the regular expression library they use before that can be fixed, and I haven't found any announcement of when or if they will do this.

The only workaround I've seen is to search for specific HEX strings:

mysql> SELECT * FROM `content` WHERE HEX(`text`) REGEXP 'C3A9C588';
+----------+
| text     |
+----------+
| siréňa   |
+----------+

Re your comment:

No, I don't know of any solution with MySQL.

You might have to switch to PostgreSQL, because that RDBMS supports \u codes for UTF characters in their regular expression syntax.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • so there is no solution if my strings are variables? – Maarty Nov 04 '13 at 19:35
  • Two years have passed since you answered this. Do you know of any updates about this topic? – e18r Nov 23 '15 at 02:44
  • @emisilva, click on the link I provided to check the status of that bug. – Bill Karwin Nov 23 '15 at 06:59
  • @emisilva If I remember right, my solution was to change collation on DB column to utf8_general_ci.. After thatm searching was working as expected... hope it helps – Maarty Nov 23 '15 at 13:37
  • @emisilva check this http://stackoverflow.com/questions/8647080/accent-insensitive-search-query-in-mysql – Maarty Nov 23 '15 at 13:38
  • For me it didn't work when I did things like cami[óo]n, but it did work as expected with camión|camion – e18r Nov 23 '15 at 13:44
  • This worked perfectly! I needed it to find corrupted, UTF-8 characters. This worked: SELECT id, SomeField from SomeTable where HEX(SomeField) RLIKE 'C382' OR HEX(SomeField) RLIKE 'C383'; – HoldOffHunger May 17 '17 at 16:20
1

Try something like ... REGEXP '(a|b|[ab])'

SELECT * FROM `content` WHERE `text` REGEXP '(s|š|Š|[sšŠ])(i|í|Í|[iíÍ])(r|ŕ|Ŕ|ř|Ř|[rŕŔřŘ])(e|é|É|ě|Ě|[eéÉěĚ])(n|ň|Ň|[nňŇ])(A|a|á|Á|ä|Ä|0|[AaáÁäÄ0])'

It works for me!

felinux
  • 11
  • 2
-3

Use the lib_mysqludf_preg library from the mysql UDF repository for PCRE regular expressions directly in mysql

Although MySQL's regular expression library does not support utf-8 the mysql UDF repository has the ability to use utf-8 compatible regex according PCRE regular expressions directly in mysql.

http://www.mysqludf.org/ https://github.com/mysqludf/lib_mysqludf_preg#readme

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Ghanshyam
  • 1
  • 2
  • Whilst this may theoretically answer the question, [it would be preferable](//meta.stackoverflow.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – Bhargav Rao Feb 23 '16 at 14:15
  • Although MySQL's regular expression library does not support utf-8 the mysql UDF repository has the ability to use utf-8 compatible regex according PCRE regular expressions directly in mysql. – Ghanshyam Feb 24 '16 at 07:42
  • Wasted my time installing this library, turns out it does not support (meaning correctly handle) utf-8 at all. SELECT PREG_RLIKE('/aeiou/', 'àèìòù') returns 0, just like SELECT 'aeiou' RLIKE 'àèìòù' does. So, this library is no solution to the fact that MySQL does not handle utf8 correctly. – matteo Apr 10 '16 at 20:23
  • Confirming, lib_mysqludf_preg does not supports UTF8 (multi-byte character encoding) at all. – Aleksey Kuznetsov Apr 28 '19 at 01:18