0

While scraping web data, I am having 'ascii' encoding issues such as:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 20-22: ordinal not in range(128),

and I came across this controversial solution which some say is dangerous:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

please see here: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

I am using Beautiful Soup, and my app indexes text collected in different languages, such as german and french, besides english.

This is the snippet generating ocasional errors:

for page in pages:
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())

traceback:

soup = BeautifulSoup(c.read()) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1522, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1147, in __init__ self._feed(isHTML=isHTML) File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 1189, in _feed SGMLParser.feed(self, markup)

what is the safest way of scraping my data here without enconding issues?

Community
  • 1
  • 1
8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
  • 1
    maybe decode data to `utf-8` before you use in `BS` `c.read().decode('utf-8')` (if this page uses `utf-8`) because problem is in BS. As I know `sys.setdefaultencoding` works with `print()` problems. – furas Nov 17 '16 at 00:16
  • Or use `requests` module which can find encoding used on page and gives decoded string in `r.text` (or not-decoded bytes in `r.content`) – furas Nov 17 '16 at 00:18

0 Answers0