36

How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines

Update:

example inputs:

html:

<head><title>I'm title</title></head>
Hello, <b>world</b>

non-html:

<ht fldf d><
<html><head> head <body></body> html
static
  • 8,126
  • 15
  • 63
  • 89

6 Answers6

55

You can use an HTML parser, like BeautifulSoup. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:

>>> from bs4 import BeautifulSoup
>>> html = """<html>
... <head><title>I'm title</title></head>
... </html>"""
>>> non_html = "This is not an html"
>>> bool(BeautifulSoup(html, "html.parser").find())
True
>>> bool(BeautifulSoup(non_html, "html.parser").find())
False

This basically tries to find any html element inside the string. If found - the result is True.

Another example with an HTML fragment:

>>> html = "Hello, <b>world</b>"
>>> bool(BeautifulSoup(html, "html.parser").find())
True

Alternatively, you can use lxml.html:

>>> import lxml.html
>>> html = 'Hello, <b>world</b>'
>>> non_html = "<ht fldf d><"
>>> lxml.html.fromstring(html).find('.//*') is not None
True
>>> lxml.html.fromstring(non_html).find('.//*') is not None
False
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • what about `non-html = " head html"` `bool(BeautifulSoup(non_html, "html.parser").find()) => True`? It is not an html snippet – static Jul 21 '14 at 00:33
  • even `non_html = " head html dkslfjglangaiowmgiowe"` will pass the test :( – static Jul 21 '14 at 00:37
  • @static well, this would be `True`, cause `BeautifulSoup` tries it best to parse the html and be lenient. It would transform it to ` head html`. – alecxe Jul 21 '14 at 00:37
  • it is nice, that it passes many problematic cases through, but it looks, that it passes too much through: `non_html = "<"` will also work – static Jul 21 '14 at 00:40
  • @static yeah, in this case it thinks `fldf` and `d` are attributes and `ht` tag is just not closed. Nice examples, thanks :) – alecxe Jul 21 '14 at 00:49
  • @static I've added an alternative solution, please check if it works for you. – alecxe Jul 21 '14 at 00:52
  • Note that `BeautifulSoup('This is not an html', 'lxml').find()` returns `

    This is not an html

    ` - so use `html.parser`.
    – Vitaly Zdanevich Nov 15 '17 at 07:51
  • `lxml.html.fromstring('').find('.//*') is not None` returns false. Shouldn't it be true? – cointreau Sep 07 '22 at 22:59
8

One way I thought of was to intersect start and end tags found by attempting to parse the text as HTML and intersecting this set with a known set of acceptable HTMl elements.

Example:

#!/usr/bin/env python

from __future__ import print_function

from HTMLParser import HTMLParser


from html5lib.sanitizer import HTMLSanitizerMixin


class TestHTMLParser(HTMLParser):

    def __init__(self, *args, **kwargs):
        HTMLParser.__init__(self, *args, **kwargs)

        self.elements = set()

    def handle_starttag(self, tag, attrs):
        self.elements.add(tag)

    def handle_endtag(self, tag):
        self.elements.add(tag)


def is_html(text):
    elements = set(HTMLSanitizerMixin.acceptable_elements)

    parser = TestHTMLParser()
    parser.feed(text)

    return True if parser.elements.intersection(elements) else False


print(is_html("foo bar"))
print(is_html("<p>Hello World!</p>"))
print(is_html("<html><head><title>Title</title></head><body><p>Hello!</p></body></html>"))  # noqa

Output:

$ python foo.py
False
True
True

This works for partial text that contains a subset of HTML elements.

NB: This makes use of the html5lib so it may not work for other document types necessarily but the technique can be adapted easily.

James Mills
  • 18,669
  • 3
  • 49
  • 62
1

You can easily extend the built-in HTMLParser that already handles the parsing, and collect (start/end) tags, attrs, and data. To assert whether the document is valid the amount of start tags should match the amount of end tags:

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.start_tags = list()
        self.end_tags = list()
        self.attributes = list()
    
    def is_text_html(self):
        return len(self.start_tags) == len(self.end_tags)

    def handle_starttag(self, tag, attrs):
        self.start_tags.append(tag)
        self.attributes.append(attrs)

    def handle_endtag(self, tag):
        self.end_tags.append(tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

Then

>>> parser = MyHTMLParser()
>>> parser.feed("<head><title>I'm title</title></head>"
                "Hello, <b>world</b>")
>>> parser.is_text_html()

True

>>> parser.feed("<ht fldf d><"
                "<html><head> head <body></body> html")
>>> parser.is_text_html()

False
Stefano Messina
  • 1,796
  • 1
  • 17
  • 22
1

If all you need to know is wether or not a string contains html text then another solution not listed here would be to use a regex expression like the following:

</?\s*[a-z-][^>]*\s*>|(\&(?:[\w\d]+|#\d+|#x[a-f\d]+);)

Bear in mind that although this would be a much faster solution than using an HTML Parser, it can be potentially inaccurate depending on the complexity of html markup you're expecting.

Here is a test of the above regex for a general idea of it's coverage.

Cabrera
  • 1,670
  • 1
  • 16
  • 16
-4

Check for ending tags. This is simplest and most robust I believe.

"</html>" in possibly_html

If there is an ending html tag, then it looks like html, otherwise not so much.

Andrew Johnson
  • 3,078
  • 1
  • 18
  • 24
  • This is a good answer, assuming the input is a complete HTML page (has ` – okoboko Jul 21 '14 at 00:09
  • This method can be expanded to search for any html ending tag such as b. A regular expression might make it faster, but the underlying principle remains the same. – Andrew Johnson Jul 21 '14 at 00:13
  • 1
    As you don't know what potential HTML tags are in the text up-front this technique won't work so well :/ – James Mills Jul 21 '14 at 00:32
-6

Expanding on the previous post I would do something like this for something quick and simple:

import sys, os

if os.path.exists("file.html"):
    checkfile=open("file.html", mode="r", encoding="utf-8")
    ishtml = False
    for line in checkfile:
        line=line.strip()
        if line == "</html>"
            ishtml = True
    if ishtml:
        print("This is an html file")
    else:
        print("This is not an html file")
Donkyhotay
  • 73
  • 6
  • what about partials and what about non-styled html (i.e.

    is not the whole line) (ok, here one can use "contains" instead of "==")

    – static Jul 21 '14 at 00:42
  • You're right didn't think of that, in that case I would probably modify *if line == " – Donkyhotay Jul 21 '14 at 01:05