4

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!

#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)

totalpages = decoded['response']['meta']['hits']/10

for page in range(totalpages + 1):
    if page>0:
        url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
        response = opener.open(url)
        articles = response.read()
        decoded = json.loads(articles)
    for url in decoded['response']['docs']:
        print url['web_url']
        urlstring = url['web_url']
        art = opener.open(urlstring)
        soup = bs4.BeautifulSoup(art.read())
        goodstuff = soup.findAll('nyt_text')
        for tag in goodstuff:
            print tag.prettify().encode("UTF")
DaWisePug
  • 129
  • 1
  • 6
  • related: [Python, Unicode, and the Windows console](http://stackoverflow.com/q/5419/4279) – jfs Dec 17 '13 at 03:05
  • btw, you are not along. [Obama also got it](http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx) – jfs Dec 17 '13 at 03:07
  • @J.F.Sebastian: I've several times almost marked a question as a dup of that one, but it's full of answers that look right and aren't. We really need somewhere that gathers all the different clunky workarounds, explains the problems with each, and makes it clear that unless you stop using either Windows or Python 2.x those clunky workarounds are as good as you're going to get… – abarnert Dec 17 '13 at 20:56
  • @abarnert: usually I just print Unicode and set appropriate `PYTHONIOENCODING` e.g., utf-8 for files, pipes and `ascii:xmlcharrefreplace` to avoid garbage in a console. http://bugs.python.org/issue1602 is overwhelming (the link is from my comment to that question). – jfs Dec 17 '13 at 21:16
  • @J.F.Sebastian: Well, `xmlcharrefreplace` isn't exactly end-user readable/friendly. Anyway, some future version of Python will use something like the Windows console objects from that issue (but integrated properly with the `io` module) to do UTF-16 output on Windows, after which the problem will go away (except for people on some old and specialized Unix systems, who tend to know what they're doing or not care). But as long as people stick with Python 2.x, it won't matter than 3.5 solved the problem… – abarnert Dec 18 '13 at 00:50

2 Answers2

7

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:

print u'\u2019'.encode('UTF-8')

The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.

So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get ’ instead of .

Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, , and .

If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.

Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.

Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.


If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Yes, thanks for the response. I tried setting the codepage of the console, but that didn't really work either. My choice of python version was determined by the administrator of the linux server my school provides, where I ultimately want to run my program. My choice of Windows is due to me using a work provided laptop to do my schoolwork. I think that I should be OK once I move my program to the server, but it makes it sort of a pain while I'm developing. Maybe I will look at alternative consoles and see if one plays nicer with unicode. Thanks again for the great explanation! – DaWisePug Dec 17 '13 at 14:31
  • @DaWisePug: What exactly do you mean by "I tried setting the codepage of the console"? If you actually switch your Windows OEM code page to UTF-8, or trick cmd.exe into using UTF-8 (either of which may break all kinds of other things), or use a third-party console emulator that's UTF-8 friendly (are there any, short of setting up an X server and running something like rxvt?), your code should work. But I suspect you didn't do any of those things. – abarnert Dec 17 '13 at 20:51
  • I tried to trick cmd.exe by issuing the chcp 65001 command. This did not work. – DaWisePug Dec 18 '13 at 22:56
  • @DaWisePug: Did you read all the comments on [issue 1602](http://bugs.python.org/issue1602) or whichever other place you found the info about cp65001? You need to create a new console window with `cmd /u` so it can handle Unicode output before you `chcp 65001` it (otherwise it ends up converting cp65001 to cp1252 and erring out when that doesn't work). You can do both things at once with `cmd /u /k chcp 65001`, which will give you a console where you can print CP65001 bytes. Which still may not work for non-BMP characters, but at least it gets you the BMP. – abarnert Dec 19 '13 at 00:37
  • @DaWisePug: Also, you will probably have to `SET PYTHONIOENCODING=utf-8` (because Python has no idea what "cp65001" is), and pick a Unicode font like Lucida Console, and your input (stdin and/or sys.argv) may still be screwed up because the console input driver may end up sending CP1252 characters to your CP65001 console, and… It's a big mess. – abarnert Dec 19 '13 at 00:43
4

@abarnert's answer contains a good explanation of the issue.

In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.

If you are printing to console, you could try to print Unicode directly:

print soup.prettify(encoding=None, formatter='html') # print Unicode

It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:

print soup.prettify('ascii', formatter='html')

It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.

For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • The result of this was the plain text printing of the HTML entities to the console, which could lead to a useful workaround. Thanks! – DaWisePug Dec 17 '13 at 14:19
  • @DaWisePug: *named* character entities are due to `formatter='html'`. Remove it if it is undesirable. – jfs Dec 17 '13 at 21:05
  • @DaWisePug: I've updated the answer to mention more general solution: print Unicode, use `PYTHONIOENCODING` for specific case. – jfs Dec 17 '13 at 21:19