1

I'm moving tens of millions of rows of text data from mysql to a search engine and can't successfully handle a Unicode error in one of the retrieved strings. I've tried to explicitly encode and decode the retrieved strings to cause Python to throw Unicode exceptions and learn where the problem lies.

This exception is thrown after running through tens of millions of rows on my laptop (sigh...), but I'm unable to catch it, skip that row and move on which is what I want. All text in the mysql database is supposed to be utf-8.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte

Here's the connection I establish using Mysql Connector/Python

cnx = mysql.connector.connect(user='root', password='<redacted>',
                          host='127.0.0.1',
                          database='bloggz',
                          charset='utf-8') 

Heres the database character settings:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR 
Variable_name LIKE 'collation%';

+--------------------------+-----------------+

| Variable_name | Value |

+--------------------------+-----------------+

| character_set_client | utf8 |

| character_set_connection | utf8 |

| character_set_database | utf8 |

| character_set_filesystem | binary |

| character_set_results | utf8 |

| character_set_server | utf8 |

| character_set_system | utf8 |

| collation_connection | utf8_general_ci |

| collation_database | utf8_general_ci |

| collation_server | utf8_general_ci |

+--------------------------+-----------------+

What's wrong with my exception handling below? Note that the variable "last_feeds_id" is not printed out either, but that's probably just a proof that the except clause doesn't work.

last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:

  try:
    # to catch UnicodeErrors and see where the prolem lies
    # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
    # also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error

    # feeds.URL is varchar(255) in mysql
    enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
    dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.title is varchar(600) in mysql
    enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
    dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.html is text in mysql
    enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
    dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

    data = {"timestamp":ts,
            "url":dec_url,
           "bid":bid,
           "title":dec_title,
           "html":dec_html}
    es.index(index="blogposts",
            doc_type="blogpost",
            body=data)
  except UnicodeDecodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeEncodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)
mattiasostmar
  • 2,869
  • 4
  • 17
  • 26

1 Answers1

-1

It complained about hex ED. Were you expecting acute-i: í? If so, then the text you have is not encoded UTF-8, but rather one of cp1250, dec8, latin1, latin2, latin5.

Does your Python source code start with

# -*- coding: utf-8 -*-

See more Python-utf8 tips

Also, review "Best Practice" here

You have charset='utf-8'; I'm not sure, but perhaps that should be charset='utf8'. Reference UTF-8 is what the world calls the character set. MySQL calls its 3-byte subset utf8. Note the absence of the dash.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • The mysql database in many years old and I didn't create it, so perhaps it isn't UTF-8 after all. But why doesn't my code catch the exception and carry on? I use Jupyter Notebook, so no -*- coding: utf-8 -*- and I can't say if I'd expect the acute-i character in the texts, it's mainly Swedish, but it could be quotes in other languages. – mattiasostmar Jul 16 '18 at 09:36
  • @mattiasostmar - All accented Western European characters will have the same issue. Can you get a hex dump of that part of the input? See also, the last paragraph that I added. – Rick James Jul 16 '18 at 17:45
  • You're right. I altered to 'latin1' in mysql.connector.coonect(...charset='latin1') and also in the .encoding('latin1'...) .decoding('latin1'...) lines and it never thew an exception. Done! – mattiasostmar Jul 18 '18 at 08:52
  • @mattiasostmar - Good. What is the `.encoding('...')` call? I don't have it in my notes. (Maybe I get to learn something here?) – Rick James Jul 18 '18 at 14:07
  • I actually only added the explicit .encoding()/.decoding() lines together with the three except clauses to trigger an exception and learn about the problem. I picked it up from [this thread](https://mail.python.org/pipermail/python-list/2012-July/627441.html), but probably I misunderstood the procedure since I'm not very knowledgable about Unicode yet. – mattiasostmar Jul 18 '18 at 19:44
  • @mattiasostmar - I asked about `encoding` because in my attempt to figure out all the types of `CHARACTER SET` errors (most fall into 5 distinct scenarios), none need the help of any kind of application conversion routines. – Rick James Jul 18 '18 at 22:29