Encode error scraping

Question

Scraping site with chineese simbols . How do i scrap chineese simbolse ??

from urllib.request import urlopen
from urllib.parse import urljoin

from lxml.html import fromstring

URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'

def parse_items():
    f = urlopen(URL)
    list_html = f.read().decode('utf-8')
    list_doc = fromstring(list_html)

    for elem in list_doc.cssselect(ITEM_PATH):
        a = elem.cssselect('a')[0]
        href = a.get('href')
        title = a.text
        em = elem.cssselect('em')[0]
        title2 = em.text
        print(href, title, title2)


def main():
    parse_items()

if __name__ == '__main__':
    main()

Error looks like this. Error looks like this Error looks like this Error looks like this Error looks like this

http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
  File "parser.py", line 27, in <module>
    main()
  File "parser.py", line 24, in main
    parse_items()
  File "parser.py", line 20, in parse_items
    print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

please provide the complete error stack for the code you gave us in the question — DomTomCat, Jun 14 '16 at 13:07
Maybe this answer [http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) can help you. — gzc, Jun 14 '16 at 13:20

score 0 · Answer 1 · answered Jun 14 '16 at 14:34

From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.

So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.

How to fix:

the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.

I would use:

import sys

def u_filter(s, encoding = sys.stdout.encoding):
    return (s.encode(encoding, errors='replace').decode(encoding)
        if isinstance(s, str) else s)

That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string

and next:

def fprint(*args, **kwargs):
    fargs = [ u_filter(arg) for arg in args ]
    print(*fargs, **kwargs)

means: filter out any offending character from unicode strings and print the remaining unchanged.

With that you can safely replace your print throwing the exception with:

fprint(href, title, title2)

Encode error scraping

1 Answers1