2

I'm quite confused. I have a source string in the database; some HTML:

"body": "<html><head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"></head><body><div dir=\"auto\">Nope no attachment</div><br><div class=\"gmail_quote\"><div dir=\"ltr\" class=\"gmail_attr\">

Here's the SELECT:

SELECT * FROM table1 WHERE column_details REGEXP '(nope.no).attach';

When I select this with REGEXP of '(nope.no).attach' it fails to match. When I uppercase the Nope, it matches. When I LCASE(column_details) and go back to lower-case nope, it matches. What's going on here? My understanding is REGEXP is case insensitive, and as far as I know I'm not enforcing binary mode... or is it binary by default? If so, how do I make it case insensitive and disable binary mode matching?

Thanks!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Floobinator
  • 388
  • 2
  • 11
  • what is the type of column_details? that's going to determine the case sensitivity, just like if the value was 'N' and you did column_details='n' – ysth May 02 '22 at 23:24
  • What is the collation of `column_details`? Please provide `SHOW CREATE TABLE table1` – Rick James May 05 '22 at 15:11

1 Answers1

1

It's not that REGEXP is case-insensitive. It depends on the collation of the column. REGEXP can be case-sensitive or insensitive.

mysql> set @h = '<html><head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"></head><body><div dir=\"auto\">Nope no attachment</div><br><div class=\"gmail_quote\"><div dir=\"ltr\" class=\"gmail_attr\">';
Query OK, 0 rows affected (0.00 sec)

mysql> select @h regexp '(nope.no).attach';
+------------------------------+
| @h regexp '(nope.no).attach' |
+------------------------------+
|                            1 |
+------------------------------+
1 row in set (0.00 sec)

mysql> select @h collate utf8mb4_bin regexp '(nope.no).attach';
+--------------------------------------------------+
| @h collate utf8mb4_bin regexp '(nope.no).attach' |
+--------------------------------------------------+
|                                                0 |
+--------------------------------------------------+

I'd check the collation of your column:

SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'table1' AND COLUMN_NAME = 'column_details';
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • note that a *blob or varbinary type will have a null collation, and match case-sensitively. – ysth May 03 '22 at 00:25
  • Ahh - it's a JSON object, so that would explain it. And you were correct! Collating to utf8mb4_unicode_ci fixed it. Thanks a ton! – Floobinator May 03 '22 at 00:28
  • BTW is there any way to have the JSON object default to utf8mb4_unicode_ci collation? My guess is defining the collation in the SELECT will slow things down quite a bit? – Floobinator May 03 '22 at 00:37
  • Regexp can't use an index anyway. In MySQL 8.0 you can define an index on the expression, but only for a fixed pattern. I assume you want something that can work with any other pattern, but that's not possible to optimize. – Bill Karwin May 03 '22 at 00:43
  • Understood. Thanks a ton @BillKarwin. Your answers were solid and very helpful! – Floobinator May 06 '22 at 15:13