Python - set encoding for scraping in many languages

Question

While scraping web data, I am having 'ascii' encoding issues such as:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 20-22: ordinal not in range(128),

and I came across this controversial solution which some say is dangerous:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

please see here: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

I am using Beautiful Soup, and my app indexes text collected in different languages, such as german and french, besides english.

This is the snippet generating ocasional errors:

for page in pages:
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())

traceback:

soup = BeautifulSoup(c.read()) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__ self._feed(isHTML=isHTML) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed SGMLParser.feed(self, markup)

what is the safest way of scraping my data here without enconding issues?

maybe decode data to `utf-8` before you use in `BS` `c.read().decode('utf-8')` (if this page uses `utf-8`) because problem is in BS. As I know `sys.setdefaultencoding` works with `print()` problems. — furas, Nov 17 '16 at 00:16
Or use `requests` module which can find encoding used on page and gives decoded string in `r.text` (or not-decoded bytes in `r.content`) — furas, Nov 17 '16 at 00:18

Python - set encoding for scraping in many languages

0 Answers0