15

I'm getting

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

when I pass text coming from a MySQL database, which I am accessing using SQLAlchemy, to this function:

re.compile(ur"<([^>]+)>", flags=re.UNICODE).sub(u" ", s)

The database encoding is utf-8 and I am even passing the encoding to the create_engine function of SQLAlchemy.

Edit: This is how I am querying the database:

doc = session.query(Document).get(doc_id)
s = doc.title

By suggestion, I passed s.decode('utf-8') to sub . The error above disappeared, but I get a different error for a different document:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xeb in position 449: invalid continuation byte

The database table is defined like this:

CREATE TABLE `articles` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(255) DEFAULT NULL,
  `cdate` datetime DEFAULT NULL,
  `link` varchar(255) DEFAULT NULL,
  `content` text,
  UNIQUE KEY `id` (`id`),
  UNIQUE KEY `link_idx` (`link`)
) ENGINE=InnoDB AUTO_INCREMENT=4127834 DEFAULT CHARSET=utf8;

Any help would be greatly appreciated

user1491915
  • 1,067
  • 1
  • 14
  • 19
  • Can we see some more code? Where does `s` come from? Would `s.decode('utf8')` fix things? – Martijn Pieters Aug 15 '12 at 15:35
  • @MartijnPieters adding s.decode('utf-8') fixes the error for that particular document, but if I try to get a different document from the database I get: UnicodeDecodeError: 'utf8' codec can't decode byte 0xeb in position 449: invalid continuation byte . So, same error, different character. – user1491915 Aug 15 '12 at 15:41
  • No, that's a different error (one decodes from ascii, the other from utf-8). That means that the second document is not UTF-8 data *at all*. Which is why we want to see where `s` comes from. – Martijn Pieters Aug 15 '12 at 15:42
  • Next question: how is the `title` field defined in your schema? – Martijn Pieters Aug 15 '12 at 15:44
  • @MartijnPieters I've added the table info to the post. – user1491915 Aug 15 '12 at 15:47
  • 1
    The problem may not be sql alchemy. Check that Mysql's encoding is really UTF-8 in the table level. (It helps if you have a GUI tool like `mysql-admin`) – Savir Aug 15 '12 at 15:49
  • @BorrajaX Sequel Pro shows the table encoding as utf-8 – user1491915 Aug 15 '12 at 16:04
  • I had to follow [these](https://stackoverflow.com/a/16737776/5272567) steps – Matthias Jan 24 '23 at 13:16

2 Answers2

37

I have solved the issue. The title column was being returned by SQLAlchemy as a str and not Unicode. I thought adding encoding='utf8' as an argument to create_engine would take care of this, however, the right way to do it is to pass it in the database URI: mysql://me@myserver/mydatabase?charset=utf8 .

Thank you for all your answers!

user1491915
  • 1,067
  • 1
  • 14
  • 19
0

For me, my database was the wrong encoding. I had to follow these steps to change postgresql encoding to utf8.

Matthias
  • 3,160
  • 2
  • 24
  • 38