While scraping web data, I am having 'ascii'
encoding issues such as:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 20-22: ordinal not in range(128)
,
and I came across this controversial solution which some say is dangerous:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
please see here: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
I am using Beautiful Soup
, and my app indexes text collected in different languages, such as german and french, besides english.
This is the snippet generating ocasional errors:
for page in pages:
try:
c = urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup = BeautifulSoup(c.read())
traceback:
soup = BeautifulSoup(c.read())
File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
what is the safest way of scraping my data here without enconding issues?