1

Scraping site with chineese simbols . How do i scrap chineese simbolse ??

from urllib.request import urlopen
from urllib.parse import urljoin

from lxml.html import fromstring

URL = 'http://list.suning.com/0-258003-0.html'
ITEM_PATH = '.clearfix .product .border-out .border-in .wrap .res-info .sell-point'

def parse_items():
    f = urlopen(URL)
    list_html = f.read().decode('utf-8')
    list_doc = fromstring(list_html)

    for elem in list_doc.cssselect(ITEM_PATH):
        a = elem.cssselect('a')[0]
        href = a.get('href')
        title = a.text
        em = elem.cssselect('em')[0]
        title2 = em.text
        print(href, title, title2)


def main():
    parse_items()

if __name__ == '__main__':
    main()

Error looks like this. Error looks like this Error looks like this Error looks like this Error looks like this

http://product.suning.com/0000000000/146422477.html Traceback (most recent call last):
  File "parser.py", line 27, in <module>
    main()
  File "parser.py", line 24, in main
    parse_items()
  File "parser.py", line 20, in parse_items
    print(href, title, title2)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
Andrew Gowa
  • 119
  • 1
  • 2
  • 7
  • please provide the complete error stack for the code you gave us in the question – DomTomCat Jun 14 '16 at 13:07
  • i have some problem with utf-8. added – Andrew Gowa Jun 14 '16 at 13:13
  • Maybe this answer [http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) can help you. – gzc Jun 14 '16 at 13:20

1 Answers1

0

From the print syntax and the imports, I assume that you use a Python3 version, since it can matter for unicode.

So, we can expect that href, title and title2 are all unicode strings (or Python 3 strings). But the print function will try to convert the strings to an encoding acceptable by the output system - for a reason I cannot know, you system uses by default ASCII, so the error.

How to fix:

  • the best way would be to make your system accept unicode. On Linux or other unixes, you can declare an UTF8 charset in LANG environment variable (export LANG=en_US.UTF-8), on Windows you can try chcp 65001 but this latter if far from being sure
  • if it does not work, or does not meet your needs, you can force an explicit encoding, or more exactly filter out offending characters, because Python3 natively uses unicode strings.

I would use:

import sys

def u_filter(s, encoding = sys.stdout.encoding):
    return (s.encode(encoding, errors='replace').decode(encoding)
        if isinstance(s, str) else s)

That means: if s is a unicode string encode it in the encoding used for stdout, replacing any non convertible character by a replacement char, and decode it back into a now clean string

and next:

def fprint(*args, **kwargs):
    fargs = [ u_filter(arg) for arg in args ]
    print(*fargs, **kwargs)

means: filter out any offending character from unicode strings and print the remaining unchanged.

With that you can safely replace your print throwing the exception with:

fprint(href, title, title2)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252