29

I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:

I want to print post.text (where text is a beautiful soup method that just returns the text). str(post.text) and post.text produce the unicode errors (on things like right apostrophe's ' and ...).

So I add post = unicode(post) above str(post.text), then I get:

AttributeError: 'unicode' object has no attribute 'text'

I also tried (post.text).encode() and (post.text).renderContents(). The latter producing the error:

AttributeError: 'unicode' object has no attribute 'renderContents'

and then I tried str(post.text).renderContents() and got the error:

AttributeError: 'str' object has no attribute 'renderContents'

It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable'' and still have access to the required text function.


Update: after suggestions:

If I add post = post.decode("utf-8") above str(post.text) I get:

TypeError: unsupported operand type(s) for -: 'str' and 'int'  

If I add post = post.decode() above str(post.text) I get:

AttributeError: 'unicode' object has no attribute 'text'

If I add post = post.encode("utf-8") above (post.text) I get:

AttributeError: 'str' object has no attribute 'text'

I tried print post.text.encode('utf-8') and got:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:

parsed_content = BeautifulSoup(original_content, "lxml")

according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.

These steps didn't seem to make a difference.

I'm using Python 2.7.4 and Beautiful Soup 4.


Solution:

After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str methods and concatenations, e.g. str(something) + post.text + str(something_else), so that it was something, post.text, something_else and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,).

Georgy
  • 12,464
  • 7
  • 65
  • 73
user1063287
  • 10,265
  • 25
  • 122
  • 218
  • possible duplicate of [Easy Q: UnicodeEncodeError: 'ascii' codec can't encode character](http://stackoverflow.com/questions/1652904/easy-q-unicodeencodeerror-ascii-codec-cant-encode-character) – R. Martinho Fernandes Apr 27 '13 at 14:59

3 Answers3

46

In Python 2, unicode objects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str:

print post.text.encode('utf-8')
icktoofay
  • 126,289
  • 21
  • 250
  • 231
  • 1
    `+ '\n\n' + post.text.encode("utf-8") + '\n\n' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)` – user1063287 Apr 28 '13 at 01:54
  • 1
    and fwiw, i am printing `type(post)` to see what i am working with and it is ``. – user1063287 Apr 28 '13 at 02:02
  • 1
    @user1063287: `encode` can't raise a `UnicodeDecodeError`. What's the traceback? – icktoofay Apr 28 '13 at 02:08
  • `Traceback (most recent call last): File "F:\path\to\program.py", line 101, in my_function(line) File "F:\path\to\program.py", line 86, in my_function + '\n\n' + post.text.encode("utf-8") + '\n\n' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)` – user1063287 Apr 28 '13 at 02:16
  • 1
    @user1063287: I guess what I'm trying to say is that I need some more context around it. I know that `post.text.encode('utf-8')` on its own should work fine; it's just that something else is then trying to decode it, and you haven't shown the code that's doing it. If you could edit your question to include a bit more context about where it's being used, that would be helpful. – icktoofay Apr 28 '13 at 04:04
  • i have tried to replicate the scenario in order to provide more detail but the replication is working. type(post) is returning `` in both cases. so in that sense, i know what i am working with. in implementation it is getting stuck on things like `'` and `...`. is there anyway i can say, perhaps at the soup definition level, `'make all of this a certain format'`? scenario is: use urllib2 to open web page, change content to beautiful soup object, search for divs of a certain class, for each div that it finds, print the text of the div. – user1063287 Apr 28 '13 at 09:14
  • ok, it had something to do with my printing methodology, i removed all my `str` methods and concatenations eg `str(something) + post.text + str(something_else)` so that it was `something, post.text, something_else` and it seems to be printing well except i have less control of the formatting at this stage (eg spaces inserted at `,`). thank you all for your assistance. – user1063287 Apr 28 '13 at 09:34
  • 2
    @user1063287: Basically, Python 2 has this weird `str` and `unicode` thing going on. If you concatenate them, then it will implicitly encode or decode (I forget which) as ASCII so that they're the same type. Of course, when you're dealing with non-ASCII things, you can't do that: you have to *explicitly* make sure everything is the same type. Python 3 fixes this by making it raise an error if you mix them rather than resorting to sometimes-works-sometimes-not behavior. – icktoofay Apr 28 '13 at 21:41
  • 1
    *In Python 2, unicode objects can only be printed if they can be converted to ASCII* This is incorrect. Python detects the locale when started, and configures `stdout` and `stderr` to automatically encode Unicode written to those file objects to be encoded. That means that for correctly configured consoles and terminals, printing non-ASCII unicode can work just fine. – Martijn Pieters Sep 15 '17 at 09:34
2
    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

worked for me ;-)

Patpog
  • 29
  • 5
0

Did you try .decode() or .decode("utf-8") ?

And, I recommend to use lxml using html5lib parser

http://lxml.de/html5parser.html

jeyraof
  • 863
  • 9
  • 28
  • i tried these and have added results to original post. i have just learnt the basics of beautiful soup and urllib2 and it has taken me about two weeks, do i really need to learn two more programs? lxml looks very difficult to me and it is why i chose beautiful soup because i could understand it more easily. and just to re-iterate i am only trying to get 'simple' engligh language text and it is balking on common elements like right apostrophe's `'` and `...`. – user1063287 Apr 28 '13 at 00:53