1

I am trying to scrape a Japanese website (a trimmed down sample below):

<html>
<head>
<meta charset="euc-jp">
</head>
<body>
<h3>不審者の出没</h3>
</body>
</html>

I am trying to get data of this html by request package using:

response = requests.get(url)

I am getting data from h3 field in as: '¡ÊÂçʬ' and unicode value of it is like this:

'\xa4\xaa\xa4\xaa\xa4\xa4\xa4\xbf\'

but when I load this html from a file or from a local wsgi server (tried with Django to serve a static html page) then I get:

不審者の出没. It's actual data.

Now I am not understanding how to resolve this issue?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
ROHIT BANSAL
  • 57
  • 1
  • 1
  • 8
  • Possible duplicate of [python requests.get() returns improperly decoded text instead of UTF-8?](https://stackoverflow.com/questions/44203397/python-requests-get-returns-improperly-decoded-text-instead-of-utf-8) – Ami Hollander Jun 18 '18 at 04:58
  • Can you show us the code? How are you accessing the text of the response? How are you parsing it to find the h3 field? – abarnert Jun 18 '18 at 04:58
  • Also, is that really the same data? Because those bytes are definitely not euc-jp for 不審者の出没. For one thing, the trailing backlash is illegal, but even without that, those bytes decode as おおいた. – abarnert Jun 18 '18 at 05:01
  • @abarnert It's not same data. I have written only a sample data. – ROHIT BANSAL Jun 19 '18 at 13:17
  • @AmiHollander After getting response, encoding of response is 'ISO-8859-1'. Due to this, I am not geting actual data – ROHIT BANSAL Jun 19 '18 at 13:21
  • @abarnert If I print the response then I am not getting actual html. – ROHIT BANSAL Jun 19 '18 at 13:26

0 Answers0