1

I am using lxml 4.5.0 to scraping data from website.

it works well in the following example

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://www.yahoo.co.jp')
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(resp.text), parser)
    result = tree.xpath('//*[@id="tabTopics1"]/a')[0]

result.text

as the result.text give me the right text 'ニュース'

but when I try another side, it failed to prase the japanese properly.

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://travel.rakuten.co.jp/')
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(resp.text), parser)
    result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]

result.text

the result.text give me 'å\x9b½å\x86\x85æ\x97\x85è¡\x8c' , it should be '国内旅行'

I tried to use parser = etree.HTMLParser(encoding='utf-8'), but it still not work.

How can I make lxml parse japanese properly in this case?

kyang922
  • 13
  • 3
  • 1
    You may want to use BeautifulSoup instead - it's geared for messy HTML and messy Unicode. – AKX Mar 03 '20 at 10:22
  • @AKX yeah, I tried BeautifulSoup at frist. but I found it does not supports XPATH, so I turn into lxml – kyang922 Mar 03 '20 at 10:24
  • I also tried selenium, it works well. but I don't want to use selenium since it need a browser working background and cause much overhead. I do not need to run javascript or something else but just parse the html. – kyang922 Mar 03 '20 at 10:30
  • Somewhere the `utf-8` in the `meta` in the second web site header gets lost; the other web site has the same but there it works. Forcing recoding works: `print (bytes(result.text, encoding='latin-1').decode('utf8'))`, but you cannot know in advance when that is necessary so no proper "solution". – Jongware Mar 03 '20 at 10:31
  • `StringIO(resp.content.decode('utf-8'))` – furas Mar 03 '20 at 11:50
  • You can try this. You don't have to worry about the encoding. from simplified_scrapy import req html = req.get('https://travel.rakuten.co.jp/') – dabingsou Mar 04 '20 at 00:03

1 Answers1

2

Using

print(resp.encoding)

you can see it used ISO-8859-1 to convert resp.content to resp.text

but you can get directly resp.content and decode it with different encoding

StringIO( resp.content.decode('utf-8') )

Using module chardet you can try to detect what encoding you should use

print( chardet.detect(resp.content) )

Result

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

import requests
from lxml import etree
from io import StringIO
import chardet

chrome_ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 " \
            "(KHTML, like Gecko) Chrome/77.0.3864.0 Safari/537.36"

with requests.Session() as s:
    s.headers.update({'User-Agent': chrome_ua})
    resp = s.get('https://travel.rakuten.co.jp/')

    print(resp.encoding)
    print( chardet.detect(resp.content) )
    detected_encoding = chardet.detect(resp.content)['encoding']

    parser = etree.HTMLParser()
    #tree = etree.parse(StringIO(resp.content.decode('utf-8')), parser)
    tree = etree.parse(StringIO(resp.content.decode(detected_encoding)), parser)
    result = tree.xpath('//*[@id="rt-nav-box"]/li[1]/a')[0]

result.text

EDIT: as @usr2564301 found in answer

python requests.get() returns improperly decoded text instead of UTF-8?

it can be resolved with

 resp.encoding = resp.apparent_encoding 

which uses chardet to recognize encoding.

furas
  • 134,197
  • 12
  • 106
  • 148
  • Is it unrelated to the fact that `meta charSet="utf-8"'` in the first seems parsed but is skipped as `meta http-equiv="Content-Type" content="text/html; charset=UTF-8"` in the second test site? Is that something requests should have seen? – Jongware Mar 03 '20 at 12:28
  • I don't know how `requests` check encoding but before HTML and `` it gets headers and it may use header `'content-type'` to recognize encoding. If you check `resp.headers['content-type']` then you see that first page sends `text/html; charset=UTF-8` but second sends only `text/html` without encoding. – furas Mar 03 '20 at 12:40
  • 1
    see [python requests.get() returns improperly decoded text instead of UTF-8?](https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8) - it seems it uses only header `'content-type'` to recognize encoding and for `text/html` it uses `ISO-8859-1` as default value. – furas Mar 03 '20 at 12:43
  • A-ha! Excellent find -- want to close this question as a duplicate? This one merely says "in some case", the dup correctly names the underlying issue. – Jongware Mar 03 '20 at 13:24
  • 1
    -- note in case, all OP's code needs is this additional line: `resp.encoding = resp.apparent_encoding`. As explained in the duplicate, requests already does a `chardet` in the background. – Jongware Mar 03 '20 at 13:31
  • @usr2564301 good point - I missed `apparent_encoding` in that answer. – furas Mar 03 '20 at 13:41
  • @furas oh, I did not notice that the problem was caused by requests. thanks for pointing out the real problem is. Thanks a lot! – kyang922 Mar 04 '20 at 02:01
  • maybe we can advise requests developer to update their default value to 'UTF-8' if the website don't specify it in header since now is the decade for HTML5. I think that will fit the real case better and make less people confused by it. – kyang922 Mar 04 '20 at 02:23
  • but what if page has only `text/html` and it uses `ISO-8859-1` in HTML ? I think this problem is so rare and there is no need to change it. – furas Mar 04 '20 at 02:36
  • 1
    I think it may have a higher possibility for the page is using `UTF-8` in that case. since HTML5 have a higher market share now and the default for HTML5 is `UTF-8`. I don't know, I am not professional in HTML standard. whatever, thank you! – kyang922 Mar 04 '20 at 02:53