1

I have a CSV file containing tweets with emojis (eg. "Cool! ") and I need to import them into a MySQL table in such a way they will be saved/displayed correctly... What do I have to set up and how for a correct import (I mean collation, etc.)?

More details:

  • In the CSV file, the emoji are visible
  • The encoding of the CSV file is UTF-8
  • I am on Windows 11

I already tried:

  • To set the character set to utf8mb4 and collation to utf8mb4_unicode_ci in the table
  • To add " SET NAMES 'utf8mb4';" (also tried with Latin1) before the LOAD query
  • 1
    I may be wrong, but when you say _"In the CSV file, the emoji are visible"_, it will depend on how you _"view"_ the CSV. If you open it in an application which handles emojis in the display layer, then they will appear as little images. If you open it in notepad, you will see the actual text. _"Normally"_, the replacement with little images occurs in the display layer of an application. – user1191247 Feb 11 '23 at 11:42
  • In the case of your example ("Cool! "), it is the unicode character [U+1F495](https://www.compart.com/en/unicode/U+1F495). – user1191247 Feb 11 '23 at 12:18
  • 1
    Your question is too vague as it stands. Can you clarify what actually happens when you _"tried everything"_? What is your _specific problem_? – skomisa Feb 11 '23 at 16:23
  • 1
    1. How are you importing the file? 2. Share the full table structure (`SHOW CREATE TABLE`). 3. How are you verifying that it _didn't_ work? What reads this, where/how? 4. What do you see instead? – Evert Feb 11 '23 at 20:15
  • is 4 hex bytes: F09F9295 -- Is that what you see? Try `SELECT HEX(col)...`. – Rick James Feb 12 '23 at 00:04
  • See also https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Feb 12 '23 at 00:05

1 Answers1

1

The table must encode text in character set utf8mb4 to store emojis.

Demo:

mysql> create table no ( t text ) character set=utf8;

mysql> load data local infile 'm.csv' into table no;

mysql> select * from no;
+---------+
| t       |
+---------+
| Cool! ? |
+---------+

So utf8 does not support emojis.

mysql> create table yes ( t text ) character set=utf8mb4;

mysql> load data local infile 'm.csv' into table yes;

mysql> select * from yes;
+------------+
| t          |
+------------+
| Cool!      |
+------------+

But utf8mb4 does support emojis. The difference is that utf8mb4 supports 4-byte encodings, but utf8 doesn't. This is an unfortunate part of MySQL's history, that they didn't implement utf8 originally to support the Supplemental Multilingual Plane of the UTF-8 standard.

Let's see if altering the first table helps.

mysql> alter table no character set utf8mb4;

mysql> load data local infile 'm.csv' into table no;

mysql> select * from no;
+---------+
| t       |
+---------+
| Cool! ? |
| Cool! ? |
+---------+

Why didn't this work? Because alter table ... character set does not convert existing columns. It only changes the table's default character set, which will not be used until the next time we add a column to that table.

We can see that the existing column is still using the old character set:

mysql> show create table no\G
*************************** 1. row ***************************
       Table: no
Create Table: CREATE TABLE `no` (
  `t` text CHARACTER SET utf8mb3  
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

utf8mb3 is the character set that utf8 is an alias for in MySQL 8.0.

To convert existing columns, use:

mysql> alter table no convert to character set utf8mb4;

mysql> show create table no\G
*************************** 1. row ***************************
       Table: no
Create Table: CREATE TABLE `no` (
  `t` mediumtext
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

No try the load again:

mysql> load data local infile 'm.csv' into table no;

mysql> select * from no;
+------------+
| t          |
+------------+
| Cool! ?    |
| Cool! ?    |
| Cool!      |
+------------+

Note that someday, MySQL may change the 'utf8' alias to mean utf8mb4. This is shown in warnings on many of the above usages of 'utf8':

'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828