0

I can find single chars using:

SELECT * FROM table WHERE column LIKE '%Â%' COLLATE utf8mb4_bin;  

-- 1 result: 246.8 ± 11.7

I have tried a dozen or more REGEXP with COLLATE utf8mb4_bin, but none return what I need. Like:

SELECT * FROM table WHERE AND HEX(column) REGEXP "^[..]*" COLLATE utf8mb4_bin;  

-- 24356, but ALL chars, not just utf8mb4_bin

SELECT * FROM table WHERE AND HEX(column) COLLATE utf8mb4_bin REGEXP "...";  

-- 24045, but ALL chars, not just utf8mb4_bin

Some of the chars I can find using the first query above, one at a time (But I need a REGEXP like query to find all, as I don't know what else exists, and I have 30M rows): ┬╜ †… ╬ô├⌐┬╝

Column collation: utf8mb4_unicode_ci

Thanks! Lou

  • Are you saying that you want to find all the characters in a column that would require the use of utf8mb4 to be properly represented? – RiggsFolly Dec 17 '21 at 16:38
  • utf8mb4_bin (I think utf8mb4 is different? Or is _bin just the method to find utf8mb4 chars?). I can find single utf8mb4_bin chars, but I do not know how many different ones are in the table, so I need to find all. Essentially a REGEXP to find all of them, then I can run replace queries for each. Thanks – Lou Radwan Dec 17 '21 at 17:15
  • `[..]*` looks for dots. You wanted `(..)*` to look for pairs of characters; see my answer. And the "collate" clause is probably irrelevant. – Rick James Dec 17 '21 at 23:58

1 Answers1

0

First, the "CHARACTER SET" (such as utf8mb4) is the encoding. You can search for such in the data.

On top of that, the "COLLATION" (such as utf8mb4_bin) is how the characters are compared. For that, you need to look in information_schema.COLUMNS.

To add to the confusion, ± looks like "Mojibake" for ±. That is where the encoding got messed up when inserting data. Not all "Mojibake" start with Â. This gives a list of Mojibake messes from "latin1" characters: http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings. In the middle of the second table is ±.

± is a rather unlikely case to home in on. Can you find another clue?

Here is a regexp to find rows with any 8-bit chars in the column colname:

SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[89A-F].';

If the table is huge, it will take a long time. Also you might want to tack a LIMIT on the end. (That was found in the same document.)

As for analyzing the cause for Mojibake, see Trouble with UTF-8 characters; what I see is not what I stored And for fixing: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Be sure to pick the case that fits your situation.

But, but, ... If Mojibake is involved, the data may be corrupted. Please provide SHOW CREATE TABLE table, SHOW VARIABLES LIKE 'char%'; I may need some more things to help you figure out what went wrong and how to fix it. (Unless my links suffice.)

An example of using that regexp:

mysql> SELECT city, country FROM `c2014`
    WHERE HEX(city) RLIKE '^(..)*[89A-F].' 
      and id%1237 = 1 and length(city) < 20
    LIMIT 7;
+------------------+---------+
| city             | country |
+------------------+---------+
| Çernoleva        | al      |
| Kashta e Bardhë  | al      |
| Sharrëdushk      | al      |
| São Tomé         | ao      |
| Teca diá Ndala   | ao      |
| Gömür            | az      |
| Balèmyouré       | bf      |
+------------------+---------+

Another note: The output would be the same for either utf8mb4, utf8, or latin1. It actually looks for "non-Ascii" characters.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks Rick James. I will investigate all you replied with and reply with results. Will be a few days at least, other squeaky wheels... – Lou Radwan Dec 20 '21 at 18:49
  • @LouRadwan - I fixed a typo in the regexp. – Rick James Dec 20 '21 at 18:51
  • Rick James, while SELECT * FROM tbl WHERE HEX(colname) RLIKE '^(..)*[8-F].'; does find chars I might want to change to entities, the chars show correct. Like degree sign, superscripts with no sup tags, greeks like mu, etc., I will read thru the links you provided, and see if anything is in there to tweak the query and find others. Thanks again. – Lou Radwan Dec 20 '21 at 19:23
  • @LouRadwan - Please provide a short list of things you want to keep and things you want to dislike. Include the `HEX(column)` where practical. Also `SHOW CREATE TABLE` and `SHOW VARIABLES LIKE 'char%';` – Rick James Dec 20 '21 at 19:47
  • Thanks @Rick James. I need to find chars like: ┬╜ †… ╬ô├⌐┬╝ – Lou Radwan Dec 21 '21 at 13:14
  • Thanks @Rick James. I need to find chars like: ┬ ╜ à ¢ â ‚¬  ⠀ ¦ ╬ ô├ ⌐ ┬ ╝ The edit you provided, is still finding superscripted chars with no sup tags, like: mm³, actual degree char, like: 100°C (not html entity °), actual greek chars like µm/km (not html code for mu, μ). (edit timed out so I pasted edit here) – Lou Radwan Dec 21 '21 at 13:21
  • Do you want to find `┬` but not `μ`? Do you want _all_ accented letters, like `Á`, or only certain ones? "Box drawing" characters are regexp "E29[45].." It will be hard to locate one group of non-ascii characters while accepting another group. I need to know the groupings to move forward. – Rick James Dec 21 '21 at 18:11