10

I'm attempting to import data (tweets and other twitter text information) into a database using Pandas and MySQL. I received the following error:

166: Warning: (1366, "Incorrect string value: '\xF0\x9F\x92\x9C\xF0\x9F...' for column 'text' at row 3") result = self._query(query)

166: Warning: (1366, "Incorrect string value: '\xF0\x9F\x98\x8D t...' for column 'text' at row 5") result = self._query(query)

After a thorough search it seems as if there's something wrong in the way my database columns are set up. I've tried setting the database charset to UTF8 and collating it to utf_unicode_ci but I still receive the same error.

The following is the code that imports the data to the database:

#To create connection and write table into MySQL

engine = create_engine("mysql+pymysql://{user}:{pw}@{lh}/{db}?charset=utf8"
                       .format(user="user",
                               pw="pass",
                               db="blahDB",
                               lh="bla.com/aald/"))

df.to_sql(con=engine, name='US_tweets', if_exists='replace')

The data I'm importing consist of the following data types: 'int64', 'object' and 'datetime64[ns]'. I found out these data types by printing the data to the console with

print(df['tweett']) >>> returns dtype 'object'

I'd appreciate any help, thanks!

Sterling King
  • 163
  • 1
  • 3
  • 13

2 Answers2

13

You need utf8mb4, not utf8, when connecting to MySQL and in the columns involved.

More python tips: http://mysql.rjweb.org/doc.php/charcoll#python (Except use utf8mb4 in place of utf8. UTF-8 should not be changed.)

A more detailed explanation to this can be found here.

Eswar
  • 1,201
  • 19
  • 45
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • God bless you Rick. – Sterling King Nov 22 '17 at 11:48
  • Thanks. works like magic. But how do you by seeing the error that changing to utf8mb4 will solve the problem? Any explanation? – Eswar Jan 25 '19 at 09:42
  • @Eswar - In some step (I don't know where) between the client and the database table, something was "utf8" when it needed to be "utf8mb4". Note: MySQL's "utf8" is not the same as the outside worlds "UTF-8", which does correspond to "utf8mb4". – Rick James Jan 25 '19 at 16:04
  • @RickJames https://stackoverflow.com/a/30074553/5658251. This link would throw more light on the answer.. – Eswar Jan 29 '19 at 06:37
  • 1
    @Eswar - Yes, that link is a good explanation of the difference between MySQL's pair of charsets. But it does not Sterling where to find his 'bug'. Non-BMP codes are hex F0xxyyzz. That is, the `F0` told me that utf8mb4 was involved. The rest of the error message implied that such a 4-byte 'character' was trying to be stuffed into something, presumably utf8, that could not handle it. Since Python was involved, I pulled out my cheat sheet on Python and utf8, hoping that the answer was there. (I don't know Python.) – Rick James Jan 29 '19 at 06:45
  • -- The first character in question. – Rick James Jan 29 '19 at 06:47
1

Change the character set and collation properties of the databases, tables, and columns to use utf8mb4 instead of utf8. docs

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)
  • I ran all those sql queries but still getting the "Incorrect string value" error. What could i be missing? – JayB Jul 02 '22 at 14:55