1

I'm getting errors when certain characters are being added to a table... even when the column is has utf8mb4 character set. For example:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE0' for column 'surname'

The data in question is: SEGAL‡ (note the double dagger)

Is this character beyond even 4 byte UTF8 or is the collation causing the issue? Or is it something else?

Screenshot showing character set and collation of the column:

screenshot showing utf8mb4 character set

It's a Laravel 8 app and the MySQL connection is configured to the following:

'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',

Looking at the CSV file in PHPStorm, non-ASCII characters are displayed as . I've tried explicitly setting the file encoding to UTF-8 in PHPStorm (with and without BOM).

If I open the CSV in Excel then the non-ASCII characters display correctly. Confused.

Update

Examining the CSV in a HEX editor shows that a character like ä is stored as a single byte (8A). When this CSV is opened in Excel it correctly shows ä, but in everything else it shows .

I don't know what character encoding Excel is using, as this character should be typically stored as E4 when using a single byte, or C3 A4 in UTF-8.

Chuck Le Butt
  • 47,570
  • 62
  • 203
  • 289
  • What is the `collation`? This should also be a `utf8mb4_` subset. You need to check both your connection character set and collation as well as the database/table/data. – Martin Jun 28 '22 at 14:32
  • @Martin The collation is visible in the screenshot: `utf8mb4_0900_ai_ci` – Chuck Le Butt Jun 28 '22 at 14:34
  • It may be visible but it was not identified `:-p` . So what's your database user connection character set? Have you [**read this**](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through)? – Martin Jun 28 '22 at 14:35
  • @Martin Updated with more information – Chuck Le Butt Jun 28 '22 at 14:54
  • Your CSV file may need a Byte Order Mark on it (BOM) [reference](https://stackoverflow.com/questions/4192186/setting-a-utf-8-in-java-and-csv-file) . Regenerate your CSV with one and see if that works? – Martin Jun 28 '22 at 14:55
  • @Martin Updated again. Confused. – Chuck Le Butt Jun 29 '22 at 00:26
  • Ok, well I've looked up the sequence you posted (`\xE0`) and online character tables can't find it. If the sequence you post is not a legitimate UTF-8 character then obviously it won't display. Can you change all your parts to UTF-16 perhaps to see if that extended set would correctly show it? Either in the source file (not via Excel) or on the database. – Martin Jun 29 '22 at 09:58
  • Further, the character you display appears to be Unicode character `U+2021` . I don't think you need UTF-16, I can find this character on UTF-8 catalogues. I can't edit my comment now. – Martin Jun 29 '22 at 10:07
  • @Martin Thanks for your help. I believe Excel is to blame during the export (although I can't fathom what encoding it's using). I've opened a new question about editing the VB script. – Chuck Le Butt Jun 29 '22 at 12:11

1 Answers1

1

Double dagger is hex E2 80 A1 (only 3 bytes) when encoded in CHARACTER SET utf8 or utf8mb4. It is also available in latin1 as hex 87.

C3A4 [ä]LATIN SMALL LETTER A WITH DIAERESIS (in utf8/utf8mb4) or E4 in latin1.

Please provide SHOW VARIABLES LIKE 'char%';

See "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored for the likely cause.

The CHARACTER SET is the "encoding" and is relevant; the COLLATION is how characters compare (eg case-folding) and is not relevant.

Nothing in the Question hints of a BOM (http://mysql.rjweb.org/doc.php/charcoll#bom_byte_order_mark) EF BB BF at the beginning of some UTF-8-encoded files.

Do not use ucs-2 (UTF-16) unless you happen to have a file encoded that way. Still, I would declare the tables as utf8, not ucs-2.

More

I plugged 8AE0E8 into a handy script (that does CONVERT(CONVERT(UNHEX('8ae0e8') USING %s) USING utf8mb4) and got:

ascii, big5, binary, cp932, eucjpms, euckr,
    gb18030, gb2312, gbk, sjis, ujis,
    utf8mb3, utf8mb4   ---invalid in these charsets
                                     swe7   3  3 '???'
                                   cp1257   5  3 '?ąč'
                                   cp1256   5  3 '?àè'
                                  geostd8   5  3 '?ჭ?'
                                   latin7   6  3 'ąč'
                             dec8, latin5   6  3 'àè'
                                      hp8   6  3 'ÁÒ'
                                    macce   6  3 'äŗŤ'
                                    cp850   6  3 'èÓÞ'
                                  keybcs2   6  3 'ĹαΦ'
                                    cp852   6  3 'ŐÓŔ'
                                   latin2   6  3 'ŕč'
                                   latin1   6  3 'Šàè'
                                   cp1250   6  3 'Šŕč'
                                    greek   6  3 'ΰθ'
                                    cp866   6  3 'Крш'
                                   cp1251   6  3 'Љаи'
                                 armscii8   6  3 'ՈՌ'
                                   hebrew   6  3 'אט'
                             koi8r, koi8u   7  3 '┼ЮХ'
                                 macroman   7  3 'ä‡Ë'
                                   tis620   8  3 'เ่'

It looks like macroman is the charset. I think that is specific to Apple.

Rick James
  • 135,179
  • 13
  • 127
  • 222