0

I am using Pandas, MySQL, and SQLAlchemy.

I am currently using pandas.to_sql() to submit my pandas DataFrame -- that contains non-ASCII text -- to my MySQL database. Note that with other DataFrames I've used, my process works perfectly

However, with non-ascii text, the following call raises an error:

df.to_sql(
        name='tableOne',
        con=engine,
        index=False,
        if_exists='replace')

Resulting Error (Last line):

UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 143: ordinal not in range(128)

After reading similar posts, I've found the answer is to specify 'utf8' in the engine, as below:

engine = create_engine(f'mysql+mysqldb://{MYSQL_USER}:{MYSQL_PASSWORD}@{MYSQL_HOST}:{MYSQL_PORT}/{MYSQL_DATABASE}?charset=utf8', encoding="utf8")

I have tried everything I could think of and everything I could find online. I changed my engine to include every comination of 'utf8', 'utf-8', etc. I've tried manually encoding each text column of my DF to UTF-8. None of this works

I've worked off the following other, similar SO posts, which seem to have similar issues and, therefore, similar solutions, yet nothing has worked for me:

Another UnicodeEncodeError when using pandas method to_sql with MySQL

How to handle encoding in Python 2.7 and SQLAlchemy 🏴‍☠️

How do I get SQLAlchemy to correctly insert a unicode ellipsis into a mySQL table?

convert pandas dataframe to utf8

Thanks for any help!

Edit: fixed brainfart: 'Unicode' -> 'ascii'

Jonathan
  • 99
  • 2
  • 10
  • What do you mean by "non-Unicode text"? You'd have to be using *very* rare characters if they still haven't been considered by Unicode. (And you would have had to invent your own encoding for them.) – lenz Feb 06 '19 at 22:55
  • Yep, you're right -- that was a typo. I meant ASCII. Essentially, I need my Python connection to MySQL to encode in UTF-8 but haven't been able to have it do so – Jonathan Feb 07 '19 at 01:08

1 Answers1

0

e5 is not ascii, nor is it UTF-8. With CHARACTER SET latin1 it is å; does that make sense? So, I suggest you have 2 errors: One place is asking for ascii, one is providing latin1, and one is expecting utf8.

However, as @lenz points out, E5 could also be the first byte of a CJK character. The error message, since it is looking for ascii, only shows the first offending byte. Seeing the next two bytes would confirm this.

I think you are missing use_unicode=True.

See python and Panda notes farther down that page.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Technically, `E5` is a valid UTF-8 start byte for code points U+5000 through U+5FFF. But given that these are CJK ideographs and that the first 142 characters of the OP's text are ASCII, I also think it's more likely that the data are encoded with an ISO-8859 encoding or some Windows Western code page. – lenz Feb 07 '19 at 08:19