I'm pretty sure that REGEXP
was unable to handle UTF-8 characters "correctly" before MySQL 8.0. With 8.0, collations are honored, and much of what you are doing can be simplified.
In particular, REGEXP simply looked at each byte, one at a time. However, your accented letters require two bytes, thereby they are not handled correctly.
Here's an example of it working 'correctly' in 8.0:
mysql> SELECT "edukacją zdrowotna" regexp 'edukacj[aąeęi] zdrowotna';
+-----------------------------------------------------------+
| "edukacją zdrowotna" regexp 'edukacj[aąeęi] zdrowotna' |
+-----------------------------------------------------------+
| 1 |
+-----------------------------------------------------------+
Or, to focus on the single character:
mysql> SELECT 'ą' REGEXP '[aąeęi]';
+-------------------------+
| 'ą' REGEXP '[aąeęi]' |
+-------------------------+
| 1 | <-- 1 == TRUE == it matched
+-------------------------+
I recommend you upgrade: 5.5 -> 5.6 -> 5.7 -> 8.0. Or dump the data and reload on 8.0. In either case, the upgrade will be quite time-consuming due to the large number of "little things" that have changed.
This chart shows that, with any collation (other than utf8_bin
), ą
= 'a': http://mysql.rjweb.org/utf8_collations.html