1

I wanted to be able to validate some html produced from a template rendering function in my Python code.

I went to the Github page for validator.w3.org to consult the API.

Following my interpretation of what I read, I tried the following code:

import requests
import urllib    

index_html = '<!DOCTYPE html>\n<html lang="en">\n<head>\n  '\
    '<meta charset="UTF-8">\n  '\
    '<title></title>\n</head>\n<body>\n  \n</body>\n</html>\n'
FRAGMENT = ''
query = {}
QUERY = 3
tokens = ['https', 'validator.w3.org', 'nu/', query, FRAGMENT]
headers = {'Content-type': 'text/html; charset=utf-8'}
query = {'out': 'json'}
query = urllib.parse.urlencode(query)
tokens[QUERY] = query
url = urllib.parse.urlunsplit(tokens)
kwargs = dict(
    headers=headers,
    data=index_html,
)
response = requests.post(url, **kwargs)

response.json() returns:

*** UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 48: ordinal not in range(128)

response.content is this:

b'{"messages":[{"type":"info","message":"The Content-Type was \xe2\x80\x9ctext/html\xe2\x80\x9d. Using the HTML parser."},{"type":"info","message":"Using the schema for HTML with SVG 1.1, MathML 3.0, RDFa 1.1, and ITS 2.0 support."},{"type":"error","lastLine":5,"lastColumn":17,"firstColumn":10,"message":"Element \xe2\x80\x9ctitle\xe2\x80\x9d must not be empty.","extract":"\n \n

The type(response.content) is <class 'bytes'>. I know that json.loads requires a string so I postulated that the response.json was throwing an exception because the content was in bytes that failed to decode into a string:

import json
json.loads(response.content.decode('utf-8'))

Indeed, same exception:

*** UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 48: ordinal not in range(128)

My knowledge has run out and has left me stuck wondering what part of this code to change in order to get the JSON from the requests.post response.

Thanks in advance for your help.

dmmfll
  • 2,666
  • 2
  • 35
  • 41
  • This seems like an issue when using Python 2.x rather than Python 3.x; can anyone else confirm? – Richard Kenneth Niescior Dec 06 '15 at 13:52
  • Doh! Your comment made me recall I had been fiddling with Python versions in my virtualenv. I just tried my code in python3.4 and achieved the expected results. Thanks. :-) – dmmfll Dec 06 '15 at 14:02

1 Answers1

0

T̶h̶e̶ ̶a̶n̶s̶w̶e̶r̶ ̶i̶s̶ ̶t̶o̶ ̶c̶h̶e̶c̶k̶ ̶t̶h̶a̶t̶ ̶i̶n̶d̶e̶e̶d̶ ̶o̶n̶e̶ ̶i̶s̶ ̶u̶s̶i̶n̶g̶ ̶P̶y̶t̶h̶o̶n̶3̶.̶x̶ ̶a̶n̶d̶ ̶n̶o̶t̶ ̶P̶y̶t̶h̶o̶n̶2̶.̶x̶ ̶w̶h̶e̶n̶ ̶o̶n̶e̶ ̶e̶x̶p̶e̶c̶t̶s̶ ̶t̶o̶ ̶b̶e̶ ̶u̶s̶i̶n̶g̶ ̶P̶y̶t̶h̶o̶n̶3̶.̶x̶!̶

See update below.

Thank you.

{'messages': [{'message': 'The Content-Type was “text/html”. Using the HTML parser.', 'type': 'info'}, {'message': 'Using the schema for HTML with SVG 1.1, MathML 3.0, RDFa 1.1, and ITS 2.0 support.', 'type': 'info'}, {'extract': '\n <title></title>\n</hea', 'firstColumn': 10, 'hiliteLength': 8, 'hiliteStart': 10, 'lastColumn': 17, 'lastLine': 5, 'message': 'Element “title” must not be empty.', 'type': 'error'}]}

Update:

There is more to this story. I was, in fact, using Python3. I just omitted the part about using py.test and the --pdb option.

How do I know I was using Python3?

Ouput from python3 test_mytest.py where inside test_mytest.py is:

if __name__ == '__main__':
    import sys
    sys.exit(pytest.main('-s --pdb'))

is this:

platform linux -- Python 3.4.3, pytest-2.8.3, py-1.4.31, pluggy-0.3.1

I was still getting encoding errors after dropping into the pdb. I found the solution in the answer by @daveagp in this post.

He has written a page on his ordeal with this problem. Thank you @daveagp.

Once I executed export PYTHONIOENCODING='utf_8' I no longer had any encoding errors.

I was mistaken about my mistake!

Community
  • 1
  • 1
dmmfll
  • 2,666
  • 2
  • 35
  • 41