Validate (X)HTML in Python

Question

What's the best way to go about validating that a document follows some version of HTML (prefereably that I can specify)? I'd like to be able to know where the failures occur, as in a web-based validator, except in a native Python app.

Please note that validation is different from tidying! Some of the answers that people are posting are about automatically correcting HTML, instead of merely verifying whether the HTML is valid or not. — Flimm, May 26 '17 at 12:00

score 35 · Answer 1 · edited Aug 18 '18 at 12:48

35

PyTidyLib is a nice python binding for HTML Tidy. Their example:

from tidylib import tidy_document
document, errors = tidy_document('''<p>f&otilde;o <img src="bar.jpg">''',
    options={'numeric-entities':1})
print document
print errors

Moreover it's compatible with both legacy HTML Tidy and the new tidy-html5.

edited Aug 18 '18 at 12:48

Artem Bernatskyi

4,185
2
26
35

answered Aug 14 '09 at 18:04

Dave Brondsema

1,137
1
12
7

3

Package in Debian: python-tidylib – sumid Oct 22 '12 at 22:26

score 18 · Accepted Answer · edited Sep 14 '19 at 13:49

XHTML is easy, use lxml.

from lxml import etree
from StringIO import StringIO
etree.parse(StringIO(html), etree.HTMLParser(recover=False))

HTML is harder, since there's traditionally not been as much interest in validation among the HTML crowd (run StackOverflow itself through a validator, yikes). The easiest solution would be to execute external applications such as nsgmls or OpenJade, and then parse their output.

score 15 · Answer 3 · edited Dec 28 '12 at 23:55

I think the most elegant way it to invoke the W3C Validation Service at

http://validator.w3.org/

programmatically. Few people know that you do not have to screen-scrape the results in order to get the results, because the service returns non-standard HTTP header paramaters

X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid (or Valid)
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0

for indicating the validity and the number of errors and warnings.

For instance, the command line

curl -I "http://validator.w3.org/check?uri=http%3A%2F%2Fwww.stalsoft.com"

returns

HTTP/1.1 200 OK
Date: Wed, 09 May 2012 15:23:58 GMT
Server: Apache/2.2.9 (Debian) mod_python/3.3.1 Python/2.5.2
Content-Language: en
X-W3C-Validator-Recursion: 1
X-W3C-Validator-Status: Invalid
X-W3C-Validator-Errors: 6
X-W3C-Validator-Warnings: 0
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Connection: close

Thus, you can elegantly invoke the W3C Validation Service and extract the results from the HTTP header:

# Programmatic XHTML Validations in Python
# Martin Hepp and Alex Stolz
# mhepp@computer.org / alex.stolz@ebusiness-unibw.org

import urllib
import urllib2

URL = "http://validator.w3.org/check?uri=%s"
SITE_URL = "http://www.heppnetz.de"

# pattern for HEAD request taken from 
# http://stackoverflow.com/questions/4421170/python-head-request-with-urllib2

request = urllib2.Request(URL % urllib.quote(SITE_URL))
request.get_method = lambda : 'HEAD'
response = urllib2.urlopen(request)

valid = response.info().getheader('X-W3C-Validator-Status')
if valid == "Valid":
    valid = True
else:
    valid = False
errors = int(response.info().getheader('X-W3C-Validator-Errors'))
warnings = int(response.info().getheader('X-W3C-Validator-Warnings'))

print "Valid markup: %s (Errors: %i, Warnings: %i) " % (valid, errors, warnings)

This url is returning 302 now and not 200. Doesn't work now! — sreeraag, Dec 29 '15 at 11:14
If security is an issue, you can easily install a local copy of the W3C validator, as described here: https://validator.w3.org/nu/about.html — Martin Hepp, Mar 30 '21 at 18:04

karlcow · Answer 4 · 2017-06-01T09:49:03.160

You can decide to install the HTML validator locally and create a client to request the validation.

Here I had made a program to validate a list of urls in a txt file. I was just checking the HEAD to get the validation status, but if you do a GET you would get the full results. Look at the API of the validator, there are plenty of options for it.

import httplib2
import time

h = httplib2.Http(".cache")

f = open("urllistfile.txt", "r")
urllist = f.readlines()
f.close()

for url in urllist:
   # wait 10 seconds before the next request - be nice with the validator
   time.sleep(10)
   resp= {}
   url = url.strip()
   urlrequest = "http://qa-dev.w3.org/wmvs/HEAD/check?doctype=HTML5&uri="+url
   try:
      resp, content = h.request(urlrequest, "HEAD")
      if resp['x-w3c-validator-status'] == "Abort":
         print url, "FAIL"
      else:
         print url, resp['x-w3c-validator-status'], resp['x-w3c-validator-errors'], resp['x-w3c-validator-warnings']
   except:
      pass

Sadly, `html5lib` [doesn't validate](http://stackoverflow.com/a/29992363/593047). — ron rothman, Jan 25 '17 at 01:15

score 8 · Answer 5 · answered Mar 27 '20 at 12:50

The html5lib module can be used to validate an HTML5 document:

>>> import html5lib
>>> html5parser = html5lib.HTMLParser(strict=True)
>>> html5parser.parse('<html></html>')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Unexpected start tag (html). Expected DOCTYPE.

score 6 · Answer 6 · answered Aug 30 '08 at 01:55

Try tidylib. You can get some really basic bindings as part of the elementtidy module (builds elementtrees from HTML documents). http://effbot.org/downloads/#elementtidy

>>> import _elementtidy
>>> xhtml, log = _elementtidy.fixup("<html></html>")
>>> print log
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 7 - Warning: discarding unexpected </html>
line 1 column 14 - Warning: inserting missing 'title' element

Parsing the log should give you pretty much everything you need.

speedplane · Answer 7 · 2023-03-20T04:22:10.773

Here is an HTML validator based on lxml's HTMLParser. It is not a complete html validator, but (1) does many of the most important checks, (2) does not require an internet connection, and (3) does not require a large library.

_html_parser = None
def validate_html(html):
    '''If lxml can properly parse the html, return the lxml representation. 
    Otherwise raise.'''
    global _html_parser
    from lxml import etree
    from StringIO import StringIO
    if not _html_parser:
        _html_parser = etree.HTMLParser(recover = False)
    return etree.parse(StringIO(html), _html_parser)

Note that this will not check for closing tags, so for example, the following will pass:

validate_html("<a href='example.com'>foo")
> <lxml.etree._ElementTree at 0xb2fd888>

However, the following wont:

validate_html("<a href='example.com'>foo</a")
> XMLSyntaxError: End tag : expected '>', line 1, column 29

When I found this answer, the score was -1. But this is the only one works for me without installing anything else. Thank you. — HuongOrchid, May 09 '19 at 13:39

score 2 · Answer 8 · answered Aug 30 '08 at 01:48

2

I think that HTML tidy will do what you want. There is a Python binding for it.

answered Aug 30 '08 at 01:48

Neall

26,428
5
49
48

score 2 · Answer 9 · edited Oct 23 '20 at 14:58

In my case the python W3C/HTML cli validation packages did not work (as of sept 2016).

I did it manually using requests like so

code:

r = requests.post('https://validator.w3.org/nu/', 
    data=open('FILE.html','rb').read(), params={'out': 'json'}, 
    headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101    Safari/537.36', 
    'Content-Type': 'text/html; charset=UTF-8'})

print r.json()

in the console:

$ echo '<!doctype html><html lang=en><head><title>blah</title></head><body></body></html>' | tee FILE.html 
$ pip install requests

$ python
Python 2.7.12 (default, Jun 29 2016, 12:46:54)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> import requests

>>> r = requests.post('https://validator.w3.org/nu/', 
...                    data=open('FILE.html', 'rb').read(), 
...                    params={'out': 'json'}, 
...                    headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36', 
...                    'Content-Type': 'text/html; charset=UTF-8'})

>>> r.text
>>> u'{"messages":[]}\n'

>>> r.json()
>>> {u'messages': []}

More documentation here python requests, W3C Validator API

Validate (X)HTML in Python

9 Answers9

Linked