How to handle Unicode (non-ASCII) characters in Python?

Question

I'm programming in Python and I'm obtaining information from a web page through the urllib2 library. The problem is that that page can provide me with non-ASCII characters, like 'ñ', 'á', etc. In the very moment urllib2 gets this character, it provokes an exception, like this:

File "c:\Python25\lib\httplib.py", line 711, in send
    self.sock.sendall(str) 
File "<string>", line 1, in sendall:
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 74: ordinal not in range(128)

I need to handle those characters. I mean, I don't want to handle the exception but to continue the program. Is there any way to, for example (I don't know if this is something stupid), use another codec rather than the ASCII? Because I have to work with those characters, insert them in a database, etc.

It would be useful if you could say, also, whether you're using Python 3+, or something earlier. — Sixten Otto, Oct 29 '09 at 17:04
Couldn't be Py3k since the urllib2 module has been removed (wrapped into urllib)... — Tim Pietzcker, Oct 29 '09 at 17:09
Duplicate: http://stackoverflow.com/questions/1020892/python-urllib2-read-to-unicode — S.Lott, Oct 29 '09 at 18:19

score 11 · Accepted Answer · edited Apr 25 '12 at 19:55

11

You just read a set of bytes from the socket. If you want a string you have to decode it:

yourstring = receivedbytes.decode("utf-8")

(substituting whatever encoding you're using for utf-8)

Then you have to do the reverse to send it back out:

outbytes = yourstring.encode("utf-8")

edited Apr 25 '12 at 19:55

agf

171,228
44
289
238

answered Oct 29 '09 at 16:58

dsimard

4,245
5
22
16

score 6 · Answer 2 · edited May 23 '17 at 12:02

6

You want to use unicode for all your work if you can.

You probably will find this question/answer useful:

urllib2 read to Unicode

edited May 23 '17 at 12:02

Community

1
1

answered Oct 29 '09 at 15:45

Paul McMillan

19,693
9
57
71

score 0 · Answer 3 · answered Oct 29 '09 at 16:08

0

You might want to look into using an actual parsing library to find this information. lxml, for instance, already addresses Unicode encode/decode using the declared character set.

answered Oct 29 '09 at 16:08

Hank Gay

70,339
36
160
222

Unfortunately a lot of website produce improperly encoded documents, generally the encoding will be mostly correct, but there will be sporadic invalid byte sequences. Some applications won't have to worry about this, but if you are crawling random public web sites, it will be a problem. – mikerobi Apr 25 '12 at 21:09

How to handle Unicode (non-ASCII) characters in Python?

3 Answers3

Linked

Related