7

What is the best technique to be used in-order to find out that a string contains a valid html with correct syntax?

I tried looking into HTMLParser from module html.parser and if it doesn't produce any error during parsing, I conclude that the string is a valid HTML . However it didn't help me as it was even parsing invalid strings without raising any errors.

from html.parser import HTMLParser

parser = HTMLParser()

parser.feed('<h1> hi')
parser.close()

I expected it to throw some exception or error since the closing tag is missing but it didn't.

Sumit
  • 631
  • 1
  • 7
  • 12
  • Which Python version do you use? https://docs.python.org/3/library/html.parser.html - quote: "This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.". You can also read this answer: https://stackoverflow.com/questions/24749103/check-if-html-tag-is-self-closing-htmlparser-python – s3n0 Jul 04 '19 at 11:39
  • @s3n0 I used python 3 . I didn't see that documentation. Is there some other library that is recommended in such cases ? – Sumit Jul 04 '19 at 11:44
  • Of course... as I mentioned before... read this answer please: https://stackoverflow.com/a/27174001/9808870 ...it might help if you need it. – s3n0 Jul 04 '19 at 11:46
  • The answer in above link only checks if the tag is self closing or not. What I want is to find out if a string is valid html text. @s3n0 – Sumit Jul 04 '19 at 11:52
  • 1
    Yes, the same module may to be used :) (`from bs4 import BeautifulSoup`). Read this one question+answer: https://stackoverflow.com/questions/24856035/how-to-detect-with-python-if-the-string-contains-html-code – s3n0 Jul 04 '19 at 12:14

2 Answers2

4
    from bs4 import BeautifulSoup
    st = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    st1="who are you"
    bool(BeautifulSoup(st, "html.parser").find())
    True
    bool(BeautifulSoup(st1, "html.parser").find())
    False
Rahul Verma
  • 2,988
  • 2
  • 11
  • 26
3

The traditional HTMLParser from html.parser doesn't validate errors from HTML tagging, it only "tokenize" each content from the string.

You might want to take a look at py_w3c. It doesn't look like that anybody looks after this module, but sure is effective in identifying errors:

from py_w3c.validators.html.validator import HTMLValidator


val = HTMLValidator()
val.validate_fragment("<h1> hey yo")

for error in val.errors:
    print(error.get("message"))
$ python3.7 html-parser.py
Start tag seen without seeing a doctype first. Expected “<!DOCTYPE html>”.
Element “head” is missing a required instance of child element “title”.
End of file seen and there were open elements.
Unclosed element “h1”.
Carlos Damázio
  • 104
  • 1
  • 5
  • 6
    This solution (py_w3c) sends the htm to the w3c server... not usable offline and generates unnecessary traffic – Jan Wilmans Jan 23 '20 at 21:12