1

I have a big issue with regexp in Mysql 5.5. I have an expression:

edukacj[aąeęi] zdrowotnej

And the regexp engine finds only edukacja zdrowotna, but when I define the regexp like:

edukacj(a|ą|e|ę|i) zdrowotnej

it find correctly also edukacji zdrowotna. Shouldn't the two above give th same result?

Marek
  • 39
  • 1
  • 3
    `zdrowotna` != `zdrowotnej` ... please include actual sample table data along with your current query or queries. – Tim Biegeleisen Jan 31 '22 at 09:38
  • Might be a dupe of [Does MySQL Regexp support Unicode matching](https://stackoverflow.com/questions/14356248/does-mysql-regexp-support-unicode-matching) – Wiktor Stribiżew Jan 31 '22 at 10:08
  • This is the query i'm using: ```SELECT nazwa FROM `kursy` where nazwa regexp '[eę]duk[aą][cć]j[aąeęi] [zżź]dr[oó]w[oó]';``` – Marek Jan 31 '22 at 10:10

1 Answers1

0

I'm pretty sure that REGEXP was unable to handle UTF-8 characters "correctly" before MySQL 8.0. With 8.0, collations are honored, and much of what you are doing can be simplified.

In particular, REGEXP simply looked at each byte, one at a time. However, your accented letters require two bytes, thereby they are not handled correctly.

Here's an example of it working 'correctly' in 8.0:

mysql> SELECT "edukacją zdrowotna" regexp 'edukacj[aąeęi] zdrowotna';
+-----------------------------------------------------------+
| "edukacją zdrowotna" regexp 'edukacj[aąeęi] zdrowotna'    |
+-----------------------------------------------------------+
|                                                         1 |
+-----------------------------------------------------------+

Or, to focus on the single character:

mysql> SELECT 'ą' REGEXP '[aąeęi]';
+-------------------------+
| 'ą' REGEXP '[aąeęi]'    |
+-------------------------+
|                       1 |   <--  1 == TRUE == it matched
+-------------------------+

I recommend you upgrade: 5.5 -> 5.6 -> 5.7 -> 8.0. Or dump the data and reload on 8.0. In either case, the upgrade will be quite time-consuming due to the large number of "little things" that have changed.

This chart shows that, with any collation (other than utf8_bin), ą = 'a': http://mysql.rjweb.org/utf8_collations.html

Rick James
  • 135,179
  • 13
  • 127
  • 222