I'm moving tens of millions of rows of text data from mysql to a search engine and can't successfully handle a Unicode error in one of the retrieved strings. I've tried to explicitly encode and decode the retrieved strings to cause Python to throw Unicode exceptions and learn where the problem lies.
This exception is thrown after running through tens of millions of rows on my laptop (sigh...), but I'm unable to catch it, skip that row and move on which is what I want. All text in the mysql database is supposed to be utf-8.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte
Here's the connection I establish using Mysql Connector/Python
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='bloggz',
charset='utf-8')
Heres the database character settings:
mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR
Variable_name LIKE 'collation%';
+--------------------------+-----------------+
| Variable_name | Value |
+--------------------------+-----------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
+--------------------------+-----------------+
What's wrong with my exception handling below? Note that the variable "last_feeds_id" is not printed out either, but that's probably just a proof that the except clause doesn't work.
last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:
try:
# to catch UnicodeErrors and see where the prolem lies
# from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
# also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error
# feeds.URL is varchar(255) in mysql
enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')
# texts.title is varchar(600) in mysql
enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')
# texts.html is text in mysql
enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')
data = {"timestamp":ts,
"url":dec_url,
"bid":bid,
"title":dec_title,
"html":dec_html}
es.index(index="blogposts",
doc_type="blogpost",
body=data)
except UnicodeDecodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
except UnicodeEncodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)
except UnicodeError as e:
print("Last feeds id: {}".format(last_feeds_id))
print(e)