1

I am using requests to request a page. The task is very simple, but I have a problem with encoding. The page contains non-ascii, Turkish characters, but in the HTML source, the result is as below:

ÇINARTEPE # What it looks like
ÇINARTEPE # What it is like in HTML source

So, the operations below do not return what I expected:

# What I have tried as encoding
req.encoding = "utf-8"
req.encoding = "iso-8859-9"
req.encoding = "iso-8859-1"

# The operations
"ÇINARTEPE" in req.text # False, it must return True
bytes("ÇINARTEPE", "utf-8") in req.content # False
bytes("ÇINARTEPE", "iso-8859-9") in req.content # False
bytes("ÇINARTEPE", "iso-8859-1") in req.content # False

All I want is to find out if "ÇINARTEPE" string is in HTML source.

Further Information

An example:

req = requests.get("http://www.eshot.gov.tr/tr/OtobusumNerede/290")
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-1"
"ÇINARTEPE" in req.text # False
req.encoding = "iso-8859-9"
"ÇINARTEPE" in req.text # False
# Supposed to return True

Environment

  • python 3.5.1
  • requests 2.10.0
Eray Erdin
  • 2,633
  • 1
  • 32
  • 66

1 Answers1

3

What you need to do is unescape the HTML Codes in your HTML. There are some answers in stackoverflow already, check this post.

But basically one method is

from HTMLParser import HTMLParser
parser = HTMLParser()
html_decoded_string = parser.unescape(html_encoded_string)

UPDATE

Got a better answer from python3 docs and tested

>>> import html
>>> html.unescape("ÇINARTEPE")
'ÇINARTEPE'
Community
  • 1
  • 1
JeanPaulDepraz
  • 625
  • 7
  • 12
  • 1
    the OP is using "python 3.5.1" and this is the module name for python 2. the python 3 equivalent is `html.parser` – Tadhg McDonald-Jensen May 19 '16 at 16:26
  • 1
    also note that the `unescape` method was made accessible from just the `html` module so in python 3 you could really just use `import html ; html_decoded_string = html.unescape(html_encoded_string)` – Tadhg McDonald-Jensen May 19 '16 at 16:29
  • I just installed and gave me `ImportError` of `markupbase` module which is in 2.x versions. @TadhgMcDonald-Jensen is right. – Eray Erdin May 19 '16 at 16:30
  • 1
    the odd thing is that `html.escape("ÇINARTEPE")` doesn't change it, I wonder why? – Tadhg McDonald-Jensen May 19 '16 at 16:32
  • Tadhg, the answer is in `'ç' in html.entities.codepoint2name` and I guess it is related to HTML's markup symbols since it is not one of them and don't need to be escaped. – JeanPaulDepraz May 19 '16 at 16:42