0

I have a column that contains following content:

+------------+
| name       |
+------------+
| 你好世界    |
| HelloWorld |
| Hello世界  |
+------------+

and I hope

SELECT `name` FROM `table`  WHERE `name` REGEXP '[u4e00-u9fa5]';

gives me only Chinese contained row like this:

+------------+
| name       |
+------------+
| 你好世界    |
+------------+

but it actually gives me contrary result:

+------------+
| name       |
+------------+
| HelloWorld |
| Hello世界  |
+------------+

I knew that:

SELECT `name` FROM `table`  WHERE `name` NOT REGEXP '[u4e00-u9fa5]';

can work as expected,but I want to know why mysql regexp gives contrary result?Is this the default setting?Or I made a mistake.Thanks in advance.

Alex Luya
  • 9,412
  • 15
  • 59
  • 91
  • 1
    The class `[u4e00-u9fa5]` may contain `u, 4, e, 0, 9, f, a, 5` and all the characters between `0` and `u` (`[0-u]`). In this case only latin alphanumeric words would be matched. Are you sure that `u4e00` means U+4E00? Without any escape? Does MariaDB supports it or you should use `\p{xx}`? – logi-kal Jul 29 '16 at 10:51
  • 1
    MySQL regexp is not Unicode aware, as far as I know. See http://stackoverflow.com/a/7606528/3832970. Perhaps, this post should be closed as a duplicate of that question. – Wiktor Stribiżew Jul 29 '16 at 10:54
  • No,I am not sure,but "NOT REGEXP " does give me the expected result. – Alex Luya Jul 29 '16 at 11:05
  • 2
    `NOT REGEXP '[u4e00-u9fa5]'` means your results should not match a record that contains `u`, `4`, `e`, `0`, `9`, `f`, `a`, `5` (case insensitively), you do not match code points at all. You might get expected results but only for the current test data. – Wiktor Stribiżew Jul 29 '16 at 11:19
  • you are right,thanks – Alex Luya Jul 29 '16 at 11:49
  • See also http://stackoverflow.com/questions/38658540/how-to-use-regular-expression-to-match-word-without-any-letter-digital-or-specia – Rick James Jul 30 '16 at 15:51

1 Answers1

0

If you are checking to see if a utf8 string has CJK characters in it:

WHERE HEX(name) REGEXP '^(..)*E[456789]'

That will not include the Chinese characters that are not in the "BMP" plane.

Rick James
  • 135,179
  • 13
  • 127
  • 222