Requests to Handle Response Encoding

Question

I am using requests to request a page. The task is very simple, but I have a problem with encoding. The page contains non-ascii, Turkish characters, but in the HTML source, the result is as below:

ÇINARTEPE # What it looks like
&#199;INARTEPE # What it is like in HTML source

So, the operations below do not return what I expected:

# What I have tried as encoding
req.encoding = "utf-8"
req.encoding = "iso-8859-9"
req.encoding = "iso-8859-1"

# The operations
"ÇINARTEPE" in req.text # False, it must return True
bytes("ÇINARTEPE", "utf-8") in req.content # False
bytes("ÇINARTEPE", "iso-8859-9") in req.content # False
bytes("ÇINARTEPE", "iso-8859-1") in req.content # False

All I want is to find out if "ÇINARTEPE" string is in HTML source.

Further Information

An example:

req = requests.get("http://www.eshot.gov.tr/tr/OtobusumNerede/290")
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-1"
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-9"
"ÇINARTEPE" in req.text # False
# Supposed to return True

Environment

python 3.5.1
requests 2.10.0

Isn't it just `html.unescape("ÇINARTEPE")`? *^checks^* yep I think that is it. — Tadhg McDonald-Jensen, May 19 '16 at 16:26
@TadhgMcDonald-Jensen, waiting for you to write the answer to mark as valid. — Eray Erdin, May 19 '16 at 16:32
JEan PAul beat me to it, I'd rather miss out on some rep then post a duplicate answer. — Tadhg McDonald-Jensen, May 19 '16 at 16:33

score 3 · Accepted Answer · edited May 23 '17 at 10:28

3

What you need to do is unescape the HTML Codes in your HTML. There are some answers in stackoverflow already, check this post.

But basically one method is

from HTMLParser import HTMLParser
parser = HTMLParser()
html_decoded_string = parser.unescape(html_encoded_string)

UPDATE

Got a better answer from python3 docs and tested

>>> import html
>>> html.unescape("&#199;INARTEPE")
'ÇINARTEPE'

edited May 23 '17 at 10:28

Community

1
1

answered May 19 '16 at 16:24

JeanPaulDepraz

625
7
12

1

the OP is using "python 3.5.1" and this is the module name for python 2. the python 3 equivalent is `html.parser` – Tadhg McDonald-Jensen May 19 '16 at 16:26
1

also note that the `unescape` method was made accessible from just the `html` module so in python 3 you could really just use `import html ; html_decoded_string = html.unescape(html_encoded_string)` – Tadhg McDonald-Jensen May 19 '16 at 16:29
I just installed and gave me `ImportError` of `markupbase` module which is in 2.x versions. @TadhgMcDonald-Jensen is right. – Eray Erdin May 19 '16 at 16:30
1

the odd thing is that `html.escape("ÇINARTEPE")` doesn't change it, I wonder why? – Tadhg McDonald-Jensen May 19 '16 at 16:32
Tadhg, the answer is in `'ç' in html.entities.codepoint2name` and I guess it is related to HTML's markup symbols since it is not one of them and don't need to be escaped. – JeanPaulDepraz May 19 '16 at 16:42

Requests to Handle Response Encoding

Further Information

Environment

1 Answers1