0

I want to search my database for any string which contains the Butterfly Emoji - - using regexp.

For example

SELECT *
     FROM   `table`
     WHERE  `text`
     REGEXP ''

I'm using REGEXP because I might want to search for Hello[[:space:]]world or similar.

I get the error

Got error 'nothing to repeat at offset 0' from regexp

This works:

SELECT *
     FROM   `table`
     WHERE  `text`
     LIKE   '%%'

But then I lose the ability to search for, say, flying[[:space:]]

My Collation is utf8mb4_unicode_ci. The database is 10.0.36-MariaDB

Terence Eden
  • 14,034
  • 3
  • 48
  • 89
  • i advice you to read [Why should I provide an MCVE for what seems to me to be a very simple SQL query?](https://meta.stackoverflow.com/questions/333952/why-should-i-provide-an-mcve-for-what-seems-to-me-to-be-a-very-simple-sql-query) – Raymond Nijland Dec 08 '18 at 14:03
  • If it uses utf-16 you should use a surrogate pair `'\uD83E\uDD8B'`, otherwise use a codepoint `'\U0001F98B'` –  Dec 08 '18 at 18:18
  • @sln - The text provided in the Question is UTF-8 (hex `F09FA68B`), not utf-16. Do not use `CHARACTER SET utf16`, nor `\u` codes. – Rick James Dec 08 '18 at 21:02
  • It could be the encoding of the MySQL source code text which means you have to use codepoints in the regex, which is always a better idea. @RickJames - Regular expression source text are not represented with _byte code_. They represent _CODEPOINTS_ which disambiguates from the encoding source of both the regex source and target texts. This is typically processed as the _Character Unit_ where as target encoding to that unit is an opaque background task when the engine runs. `\uXXXX` `\u{XX}` `\x{XX}` are encoding independent _CODEPOINT_ constructs. –  Dec 09 '18 at 11:59
  • @sln - `\uD83E\uDD8B` is, to MySQL, 10 ascii bytes. `\u` turns into `u`. Note the absence of `\u` on https://dev.mysql.com/doc/refman/8.0/en/string-literals.html . It _may_ be that the _client_ is interpreting codepoints. `SELECT LENGTH('\uD83E\uDD8B');` --> 10. See also `HEX()` and `CHAR_LENGTH()`. – Rick James Dec 09 '18 at 18:13
  • @sln - The plot thickens. `SELECT HEX('\U+1F98B');` --> `F09FA68B`. (Note the added `+`.) But `\U+0034` acts like an ascii string. I gotta go scratch my head. – Rick James Dec 09 '18 at 18:21
  • 1
    @RickJames - Ok, did little research. Num 1: If using version MySQL 8.0.4 or greater it uses the ICU regex engine. Num 2: A back slash has to be escaped to get a backslash to the engine. Num 3: Codepoint notation in ICU uses construct `\x{}`. Num 4: Have to use the `REGEXP` keyword. Num 5: sometimes the target is in utf-16, otherwise its utff-8/32. So, if you have a known mbcs, convet it to 8/16/32 first. Putting that into place, try this. `REGEXP '\\x{D83E}\\x{DD8B}'` or `REGEXP '\\x{1F98B}'` –  Dec 10 '18 at 01:50
  • 1
    https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=7c3c55c88b81996677e6271c0a0f51f5 I selected MySQL version 8. I don't think mariadb uses ICU engine like MySQL does for it's regex. So, good luck with that stuff .. –  Dec 10 '18 at 02:00

2 Answers2

2

Honestly I don't know why, but escaping your butterfly will give the desired output. (At least in my version, MariaDB 10.3.10, which gave the same error without escaping).

SELECT * FROM `table` WHERE `text` REGEXP '\\'

(note the double backslash, the first one is to escape the backslash within the string, yielding in the regular expression \)

Peter van der Wal
  • 11,141
  • 2
  • 21
  • 29
  • Interesting. When I try that, it finds any text with the `?` character - as well as ones with the ``. Very strange! So your answer gets me closer - but itsn't quite right for me. – Terence Eden Dec 08 '18 at 14:17
  • @TerenceEden: I can't reproduce the problem, see [dbfiddle](https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=182b6a4183397886f20833bc3aa4a9f8). – wchiquito Dec 08 '18 at 14:31
  • That fiddle is using MariaDB 10.3.11. I'm using 10.0.36. I suspect that's why. – Terence Eden Dec 08 '18 at 14:42
  • I was using 10.3 as well, with the same error (although I got that error only on a regexp with a single butterfly, not with the `flying[[:space:]]` prefix). Maybe some server or connection setting? Or maybe PhpMyAdmin going wrong in my case... – Peter van der Wal Dec 08 '18 at 14:47
  • This helped me on MariaDB 10.1.44 searching for `(foo||baz)`. Changing to `(foo|\\|baz)` performed the search as expected without errors. All character sets (client, database, etc.) were already set properly in each case. – Christopher Schultz Jun 13 '20 at 14:33
0
SHOW VARIABLES LIKE 'char%';

It sounds like you have not told MySQL what encoding the client is using for characters. This is best done via the connection parameters, or via mysqli_charset (if using mysqli, not PDO).

Also, run this on your version:

SELECT 'ab' REGEXP '?';

I suspect it will give you the same error message.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Show Variables gives character_set_client utf8mb4 character_set_connection utf8mb4 character_set_database latin1 character_set_filesystem binary character_set_results utf8mb4 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ So it looks correct? You're right about the error from the select statement. – Terence Eden Dec 08 '18 at 21:36
  • Yes, that looks correct. See "question mark" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Dec 09 '18 at 01:27