What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.
-
Please note that validation is different from tidying! Some of the answers that people are posting are about automatically correcting HTML, instead of merely verifying whether the HTML is valid or not. – Flimm May 26 '17 at 12:00
9 Answers
PyTidyLib is a nice python binding for HTML Tidy. Their example:
from tidylib import tidy_document
document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''',
options={'numeric-entities':1})
print document
print errors
Moreover it's compatible with both legacy HTML Tidy and the new tidy-html5.

- 4,185
- 2
- 26
- 35

- 1,137
- 1
- 12
- 7
XHTML is easy, use lxml.
from lxml import etree
from StringIO import StringIO
etree.parse(StringIO(html), etree.HTMLParser(recover=False))
HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.

- 1,845
- 2
- 19
- 28

- 197,344
- 39
- 212
- 226
I think the most elegant way it to invoke the W3C Validation Service at
http://validator.w3.org/
programmatically. Few people know that you do not have to screen-scrape the results in order to get the results, because the service returns non-standard HTTP header paramaters
X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid (or Valid)
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0
for indicating the validity and the number of errors and warnings.
For instance, the command line
curl -I "http://validator.w3.org/check?uri=http%3A%2F%2Fwww.stalsoft.com"
returns
HTTP/1.1 200 OK
Date: Wed, 09 May 2012 15:23:58 GMT
Server: Apache/2.2.9 (Debian) mod_python/3.3.1 Python/2.5.2
Content-Language: en
X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Connection: close
Thus, you can elegantly invoke the W3C Validation Service and extract the results from the HTTP header:
# Programmatic XHTML Validations in Python
# Martin Hepp and Alex Stolz
# mhepp@computer.org / alex.stolz@ebusiness-unibw.org
import urllib
import urllib2
URL = "http://validator.w3.org/check?uri=%s"
SITE_URL = "http://www.heppnetz.de"
# pattern for HEAD request taken from
# http://stackoverflow.com/questions/4421170/python-head-request-with-urllib2
request = urllib2.Request(URL % urllib.quote(SITE_URL))
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)
valid = response.info().getheader('X-W3C-Validator-Status')
if valid == "Valid":
valid = True
else:
valid = False
errors = int(response.info().getheader('X-W3C-Validator-Errors'))
warnings = int(response.info().getheader('X-W3C-Validator-Warnings'))
print "Valid markup: %s (Errors: %i, Warnings: %i) " % (valid, errors, warnings)

- 188,989
- 46
- 291
- 292

- 1,380
- 12
- 20
-
1
-
-
If security is an issue, you can easily install a local copy of the W3C validator, as described here: https://validator.w3.org/nu/about.html – Martin Hepp Mar 30 '21 at 18:04
You can decide to install the HTML validator locally and create a client to request the validation.
Here I had made a program to validate a list of urls in a txt file. I was just checking the HEAD to get the validation status, but if you do a GET you would get the full results. Look at the API of the validator, there are plenty of options for it.
import httplib2
import time
h = httplib2.Http(".cache")
f = open("urllistfile.txt", "r")
urllist = f.readlines()
f.close()
for url in urllist:
# wait 10 seconds before the next request - be nice with the validator
time.sleep(10)
resp= {}
url = url.strip()
urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url
try:
resp, content = h.request(urlrequest, "HEAD")
if resp['x-w3c-validator-status'] == "Abort":
print url, "FAIL"
else:
print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']
except:
pass

- 6,977
- 4
- 38
- 72
-
3Sadly, `html5lib` [doesn't validate](http://stackoverflow.com/a/29992363/593047). – ron rothman Jan 25 '17 at 01:15
The html5lib module can be used to validate an HTML5 document:
>>> import html5lib
>>> html5parser = html5lib.HTMLParser(strict=True)
>>> html5parser.parse('<html></html>')
Traceback (most recent call last):
...
html5lib.html5parser.ParseError: Unexpected start tag (html). Expected DOCTYPE.

- 790
- 5
- 12
Try tidylib. You can get some really basic bindings as part of the elementtidy module (builds elementtrees from HTML documents). http://effbot.org/downloads/#elementtidy
>>> import _elementtidy
>>> xhtml, log = _elementtidy.fixup("<html></html>")
>>> print log
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: discarding unexpected </html>
line 1 column 14 - Warning: inserting missing 'title' element
Parsing the log should give you pretty much everything you need.

- 119,832
- 11
- 95
- 108
Here is an HTML validator based on lxml's HTMLParser. It is not a complete html validator, but (1) does many of the most important checks, (2) does not require an internet connection, and (3) does not require a large library.
_html_parser = None
def validate_html(html):
'''If lxml can properly parse the html, return the lxml representation.
Otherwise raise.'''
global _html_parser
from lxml import etree
from StringIO import StringIO
if not _html_parser:
_html_parser = etree.HTMLParser(recover = False)
return etree.parse(StringIO(html), _html_parser)
Note that this will not check for closing tags, so for example, the following will pass:
validate_html("<a href='example.com'>foo")
> <lxml.etree._ElementTree at 0xb2fd888>
However, the following wont:
validate_html("<a href='example.com'>foo</a")
> XMLSyntaxError: End tag : expected '>', line 1, column 29

- 15,673
- 16
- 86
- 138
-
1When I found this answer, the score was -1. But this is the only one works for me without installing anything else. Thank you. – HuongOrchid May 09 '19 at 13:39
-
In my case the python W3C/HTML cli validation packages did not work (as of sept 2016).
I did it manually using requests
like so
code:
r = requests.post('https://validator.w3.org/nu/',
data=open('FILE.html','rb').read(), params={'out': 'json'},
headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36',
'Content-Type': 'text/html; charset=UTF-8'})
print r.json()
in the console:
$ echo '<!doctype html><html lang=en><head><title>blah</title></head><body></body></html>' | tee FILE.html
$ pip install requests
$ python
Python 2.7.12 (default, Jun 29 2016, 12:46:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> r = requests.post('https://validator.w3.org/nu/',
... data=open('FILE.html', 'rb').read(),
... params={'out': 'json'},
... headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36',
... 'Content-Type': 'text/html; charset=UTF-8'})
>>> r.text
>>> u'{"messages":[]}\n'
>>> r.json()
>>> {u'messages': []}
More documentation here python requests, W3C Validator API

- 11,215
- 6
- 46
- 56

- 6,571
- 3
- 55
- 49