I wanted to be able to validate some html produced from a template rendering function in my Python code.
I went to the Github page for validator.w3.org to consult the API.
Following my interpretation of what I read, I tried the following code:
import requests
import urllib
index_html = '<!DOCTYPE html>\n<html lang="en">\n<head>\n '\
'<meta charset="UTF-8">\n '\
'<title></title>\n</head>\n<body>\n \n</body>\n</html>\n'
FRAGMENT = ''
query = {}
QUERY = 3
tokens = ['https', 'validator.w3.org', 'nu/', query, FRAGMENT]
headers = {'Content-type': 'text/html; charset=utf-8'}
query = {'out': 'json'}
query = urllib.parse.urlencode(query)
tokens[QUERY] = query
url = urllib.parse.urlunsplit(tokens)
kwargs = dict(
headers=headers,
data=index_html,
)
response = requests.post(url, **kwargs)
response.json()
returns:
*** UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 48: ordinal not in range(128)
response.content
is this:
b'{"messages":[{"type":"info","message":"The Content-Type was \xe2\x80\x9ctext/html\xe2\x80\x9d. Using the HTML parser."},{"type":"info","message":"Using the schema for HTML with SVG 1.1, MathML 3.0, RDFa 1.1, and ITS 2.0 support."},{"type":"error","lastLine":5,"lastColumn":17,"firstColumn":10,"message":"Element \xe2\x80\x9ctitle\xe2\x80\x9d must not be empty.","extract":"\n \n
The type(response.content)
is <class 'bytes'>
.
I know that json.loads
requires a string so I postulated that the response.json
was throwing an exception because the content was in bytes that failed to decode into a string:
import json
json.loads(response.content.decode('utf-8'))
Indeed, same exception:
*** UnicodeEncodeError: 'ascii' codec can't encode character '\u201c' in position 48: ordinal not in range(128)
My knowledge has run out and has left me stuck wondering what part of this code to change in order to get the JSON from the requests.post
response
.
Thanks in advance for your help.