2

I want to scrape some contents from a webpage, this is the code:

import requests
from bs4 import BeautifulSoup
import urllib2
url = "anUrl"
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
print soup.prettify()

This is the error description: unicodeencodeerror: 'charmap' codec can't encode character u'\u2013' in position :character maps to undefined

This kind of error should depends about different characters, not ever the same, so i need a generic solution.

Poggio
  • 131
  • 3
  • 9

2 Answers2

2

I think you have the same problem : UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)

So you can use u'\u2013'.encode('utf8') :) (to be more specific, use soup.prettify().encode('utf8'))

Or switch to Python 3 ;)

Community
  • 1
  • 1
Labo
  • 2,482
  • 2
  • 18
  • 38
  • I've still watched at that answer, i'm forced to use Python 2.*, but i don know where to put u'\u2013'.encode('utf8') in my code. – Poggio Oct 15 '15 at 15:23
  • should be `r.text.encode('utf8')` or `r.content.encode('utf8')` i don't know where exactly you get the error – EsseTi Oct 15 '15 at 15:25
  • 1
    You don't say exactly where you're getting your error, but from your description it sounds like you might need to properly encode the pretty soup going out to the terminal with: `print soup.prettify().encode('utf8')`. – xnx Oct 15 '15 at 15:36
1

To fix the print command, you can explicitly encode the output. You have many different choices depending on how you want to treat Unicode characters.

If you simply want to eliminate any characters that aren't supported by your console:

print soup.prettify().encode(sys.stdout.encoding, 'ignore')

If you want to replace characters that aren't supported with a placeholder character (typically a question mark):

print soup.prettify().encode(sys.stdout.encoding, 'replace')

If you want to show any non-ASCII characters as an escape sequence:

print soup.prettify().encode('raw_unicode_escape')

When you're ready to write to HTML output, you should encode it consistently to the encoding that your web page will use, preferably UTF-8.

f.write(soup.prettify().encode('utf-8'))
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Do you know how to print in browser the py script output trough javascript? In a previous python script i've used this: print "Content-type: text\n\n" but in that case i was not using BeautifulSoup, so now i'm not able to pass an useful object to the js script. – Poggio Oct 16 '15 at 15:22
  • @Poggio sorry, I haven't yet used Python to output a web page so it's outside of my area of expertise. – Mark Ransom Oct 16 '15 at 15:34