13

I'm programming in Python and I'm obtaining information from a web page through the urllib2 library. The problem is that that page can provide me with non-ASCII characters, like 'ñ', 'á', etc. In the very moment urllib2 gets this character, it provokes an exception, like this:

File "c:\Python25\lib\httplib.py", line 711, in send
    self.sock.sendall(str) 
File "<string>", line 1, in sendall:
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 74: ordinal not in range(128)

I need to handle those characters. I mean, I don't want to handle the exception but to continue the program. Is there any way to, for example (I don't know if this is something stupid), use another codec rather than the ASCII? Because I have to work with those characters, insert them in a database, etc.

agf
  • 171,228
  • 44
  • 289
  • 238

3 Answers3

11

You just read a set of bytes from the socket. If you want a string you have to decode it:

yourstring = receivedbytes.decode("utf-8") 

(substituting whatever encoding you're using for utf-8)

Then you have to do the reverse to send it back out:

outbytes = yourstring.encode("utf-8")
agf
  • 171,228
  • 44
  • 289
  • 238
dsimard
  • 4,245
  • 5
  • 22
  • 16
6

You want to use unicode for all your work if you can.

You probably will find this question/answer useful:

urllib2 read to Unicode

Community
  • 1
  • 1
Paul McMillan
  • 19,693
  • 9
  • 57
  • 71
0

You might want to look into using an actual parsing library to find this information. lxml, for instance, already addresses Unicode encode/decode using the declared character set.

Hank Gay
  • 70,339
  • 36
  • 160
  • 222
  • Unfortunately a lot of website produce improperly encoded documents, generally the encoding will be mostly correct, but there will be sporadic invalid byte sequences. Some applications won't have to worry about this, but if you are crawling random public web sites, it will be a problem. – mikerobi Apr 25 '12 at 21:09