0

I am experiencing some issue with MySQL encoding parsing data and then storing data from the Twitter API.

The tweet that struggles to get stored into the db is:

   INSERT INTO `statuses` (`status_id`,`text`) VALUES('93332222111111','The beers are on me in this case!�')

The character is this one. whereas the following got stored successfully:

INSERT INTO `statuses` (`status_id`,`text`) VALUES('485072105225921','RT @someone:  Don\'t forget to index timestamp columns like \"created_at\" if you query against them.\n\ne.g.: ORDER BY created_at\ne.g.: WH')

Let's have a look into the character set:

SHOW VARIABLES LIKE 'character_set%'

which brings back

character_set_client    utf8mb4
character_set_connection    utf8mb4
character_set_database  utf8mb4
character_set_filesystem    binary
character_set_results   utf8mb4
character_set_server    utf8
character_set_system    utf8
character_sets_dir  /usr/local/Cellar/mysql/5.7.18_1/share/mysql/charsets/

Anything obvious that I am missing?

Update: Also the following if block:

if utf8.ValidString(strings.Join(values, ",")) == false {
            fmt.Println(strings.Join(values, ","))
        }

returns:

'The beers are on me in this case!�','943304851980963841' 
Dalton Cézane
  • 3,672
  • 2
  • 35
  • 60
thitami
  • 828
  • 2
  • 21
  • 44

2 Answers2

1

Two suggestions possible to solve your problem:

  • Use UTF16 charset;
  • Use utf8mb4 as char set and utf8mb4_unicode_ci as collation.

You can use the following code as an example, extracted from an online tutorial:

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

As your problem is not with the database, you have to use the right code representing the image. I suggest you use "emoji-java": a lightweight java library that helps you use Emojis in your java applications.

An example:

String str = "An :grinning:awesome :smiley:string 😄with a few :wink:emojis!";
String result = EmojiParser.parseToUnicode(str);
System.out.println(result);
// Prints:
// "An awesome string with a few emojis!"

Another edit: You just tell now the language you are using: Go. In this case, you can take a look at Go-emoji here. Or even this another emoji project.

Dalton Cézane
  • 3,672
  • 2
  • 35
  • 60
  • Thanks for the suggestions, @Dalton. The second one is what I currently have in place without any luck – thitami Feb 20 '18 at 18:08
  • You have char set and collation configured to the database and the table? When you print the text before save it to DB, the characters are correct? Are you sure that the problem is only when the character is being saved at DB? Perhaps the problem be when the data is decoded... before you send it to DB. – Dalton Cézane Feb 20 '18 at 18:18
  • See here: INSERT INTO `statuses` (`text`,`status_id`) VALUES('The beers are on me in this case!�','943304851980963841') is the db query that it's trying to execute. Notice how the "" char is converted to "�" – thitami Feb 20 '18 at 20:19
  • So your problem is not with the DB. I updated the answer. You need to specify better your questions descriptions to avoid unnecessary answers. – Dalton Cézane Feb 20 '18 at 22:54
  • Thanks for the provided example. I am experimenting with Golang in this mini-project. – thitami Feb 21 '18 at 08:20
  • I added two more suggestions as you just told now what language you are using. I hope you get success! – Dalton Cézane Feb 21 '18 at 13:11
  • 1
    Caution! Those two `ALTER TABLE` statements do different things. One may help, the other may make things worse. `SELECT HEX(text) ...` to see what the situation is. – Rick James Feb 22 '18 at 01:30
  • 1
    I have tried to change the charset and collation in workbench database and table setting, changing the variables start with `charset` and some other solutions but none of them worked. Thus I follow this answer and try to execute the first 2 statements and surprisingly I'm able to insert Chinese characters successfully. Btw, I'm using AWS RDS and I'm using the `utf8mb4` charset. – jet_choong Dec 17 '19 at 14:24
0

Do not use utf16 for anything.

Use MySQL's CHARACTER SET utf8mb4; it is equivalent to the outside world's UTF-8, and includes the characters that begin with hex F0. (MySQL's utf8 does not include them.)

, hex F09F94A5 needs utf8mb4. Ditto for , hex F09F8DBB.

indicates that something is already messed up. That is, the INSERT was given a bad character.

Please provide SELECT HEX(text) ... to see if you get one of those hex values.

Also see "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for your input here, @Rick James. That indicates that something is wrong in the golang side then. – thitami Feb 22 '18 at 08:41