Validating if a string is a valid HTML in python?

Question

What is the best technique to be used in-order to find out that a string contains a valid html with correct syntax?

I tried looking into HTMLParser from module html.parser and if it doesn't produce any error during parsing, I conclude that the string is a valid HTML . However it didn't help me as it was even parsing invalid strings without raising any errors.

from html.parser import HTMLParser

parser = HTMLParser()

parser.feed('<h1> hi')
parser.close()

I expected it to throw some exception or error since the closing tag is missing but it didn't.

Which Python version do you use? https://docs.python.org/3/library/html.parser.html - quote: "This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.". You can also read this answer: https://stackoverflow.com/questions/24749103/check-if-html-tag-is-self-closing-htmlparser-python — s3n0, Jul 04 '19 at 11:39
@s3n0 I used python 3 . I didn't see that documentation. Is there some other library that is recommended in such cases ? — Sumit, Jul 04 '19 at 11:44
Of course... as I mentioned before... read this answer please: https://stackoverflow.com/a/27174001/9808870 ...it might help if you need it. — s3n0, Jul 04 '19 at 11:46
The answer in above link only checks if the tag is self closing or not. What I want is to find out if a string is valid html text. @s3n0 — Sumit, Jul 04 '19 at 11:52
Yes, the same module may to be used :) (`from bs4 import BeautifulSoup`). Read this one question+answer: https://stackoverflow.com/questions/24856035/how-to-detect-with-python-if-the-string-contains-html-code — s3n0, Jul 04 '19 at 12:14

score 4 · Accepted Answer · answered Jul 04 '19 at 11:57

4

    from bs4 import BeautifulSoup
    st = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    st1="who are you"
    bool(BeautifulSoup(st, "html.parser").find())
    True
    bool(BeautifulSoup(st1, "html.parser").find())
    False

answered Jul 04 '19 at 11:57

Rahul Verma

2,988
2
11
26

4

This doesn't work. It returns `True` for invalid html like `
div>` and `
div<`
– Guilherme Garnier Jun 02 '22 at 18:35

score 3 · Answer 2 · answered Jul 04 '19 at 12:41

The traditional HTMLParser from html.parser doesn't validate errors from HTML tagging, it only "tokenize" each content from the string.

You might want to take a look at py_w3c. It doesn't look like that anybody looks after this module, but sure is effective in identifying errors:

from py_w3c.validators.html.validator import HTMLValidator


val = HTMLValidator()
val.validate_fragment("<h1> hey yo")

for error in val.errors:
    print(error.get("message"))

$ python3.7 html-parser.py
Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
Element “head” is missing a required instance of child element “title”.
End of file seen and there were open elements.
Unclosed element “h1”.

This solution (py_w3c) sends the htm to the w3c server... not usable offline and generates unnecessary traffic — Jan Wilmans, Jan 23 '20 at 21:12

Validating if a string is a valid HTML in python?

2 Answers2