0

I am trying to scrape the NBA game predictions on FiveThirtyEight. I usually use urllib2 and BeautifulSoup to scrape data from the web. However, the html that is returning from this process is very strange. It is a string of characters such as "\x82\xdf\x97S\x99\xc7\x9d". I cannot encode it into regular text. Here is my code:

from urllib2 import urlopen
html = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/').read()

This method works on other websites and other pages on 538, but not this one.

Edit: I tried to decode the string using

html.decode('utf-8')

and the method located here, but I got the following error message:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

Community
  • 1
  • 1
Michael Frasco
  • 404
  • 3
  • 7
  • 1
    this might help http://stackoverflow.com/questions/4267019/double-decoding-unicode-in-python – Keatinge Apr 14 '16 at 02:22
  • 1
    Welcome to life working with Unicode characters in Python :) – Akshat Mahajan Apr 14 '16 at 02:27
  • I tried to decode the string using html..encode('raw_unicode_escape').decode('utf-8') but I get the following error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte – Michael Frasco Apr 14 '16 at 02:30

1 Answers1

0

That page appears to return gzipped data by default. The following should do the trick:

from urllib2 import urlopen
import zlib

opener = urlopen('http://projects.fivethirtyeight.com/2016-nba-picks/')
if 'gzip' in opener.info().get('Content-Encoding', 'NOPE'):
    html = zlib.decompress(opener.read(), 16 + zlib.MAX_WBITS)
else:
    html = opener.read()

The result went into BeautifulSoup with no issues.

The HTTP headers (returned by the .info() above) are often helpful when trying to deduce the cause of issues with the Python url libraries.

Steve Cohen
  • 722
  • 4
  • 6