I'm scraping news articles from various sites, using GAE and Python.
The code where I scrape one article url at a time leads to the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
Here's my code in its simplest form:
from google.appengine.api import urlfetch
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
return result.content
Here is another variant I have tried, with the same result:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
s = s.decode('utf-8')
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
Here's the ugly, brittle one, which also doesn't work:
def fetch(url):
headers = {'User-Agent' : "Chrome/11.0.696.16"}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
try:
s = s.decode('iso-8859-1')
except:
pass
try:
s = s.decode('ascii')
except:
pass
try:
s = s.decode('GB2312')
except:
pass
try:
s = s.decode('Windows-1251')
except:
pass
try:
s = s.decode('Windows-1252')
except:
s = "did not work"
s = s.encode('utf-8')
s = unicode(s,'utf-8')
return s
The last variant returns s as the string "did not work" from the last except.
So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?
Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.
Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.