I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?
-
Were any of these answers what you were looking for? If you need more info we can certainly help? – JudoWill Jun 20 '10 at 21:17
-
@JudoWill: Yeah I was able to get BeautifulSoup and Tidy set up. Unfortunately they weren't catching a lot of the issues I was having. I ended up building my own function to go cycle through the DOM and fix the issues. Thanks for the help! – Joel Jun 21 '10 at 02:55
-
Could you post your own function as an answer. This is an issue that I have a lot of the time and I'm always looking for new solutions. :) – JudoWill Jun 21 '10 at 14:38
5 Answers
I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.
from bs4 import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettify()
I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.
-
1
-
1@Tarantula. I agree, BeautifulSoup is pretty slow, but its the only thing I've ever seen that can parse some of those crazy malformed HTML based tables out there. – JudoWill Jun 19 '10 at 01:44
An example of cleaning up HTML using the lxml.html.clean.Cleaner module.
Requires the lxml
module — pip install lxml
(it's a native module written in C so it might be faster than pure python solutions).
import sys
from lxml.html.clean import Cleaner
def sanitize(dirty_html):
cleaner = Cleaner(page_structure=True,
meta=True,
embedded=True,
links=True,
style=True,
processing_instructions=True,
inline_style=True,
scripts=True,
javascript=True,
comments=True,
frames=True,
forms=True,
annoying_tags=True,
remove_unknown_tags=True,
safe_attrs_only=True,
safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
remove_tags=('span', 'font', 'div')
)
return cleaner.clean_html(dirty_html)
if __name__ == '__main__':
with open(sys.argv[1]) as fin:
print(sanitize(fin.read()))
Check out the docs for a full list of options you can pass to the Cleaner.

- 28,968
- 18
- 162
- 169
-
how it can clean from code tags (div) with specific 'id' or 'class'? (completely, include text). – Lexx Luxx Sep 02 '21 at 12:44
-
@triwo: this is not supported ootb, but you can parse the markup and remove the nodes by class or id with lxml; e.g. see https://stackoverflow.com/questions/8226490 – ccpizza Sep 02 '21 at 22:04
There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.

- 15,774
- 5
- 45
- 57
I am using lxml to convert HTML to proper (well-formed) XML:
from lxml import etree
tree = etree.HTML(input_text.replace('\r', ''))
output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml")
for stree in tree ])
... and doing lot of removing of 'dangerous elements' in the middle....

- 9,122
- 1
- 25
- 34
This can be done using the tidy_document function in tidylib module.
import tidylib
html = '<html>...</html>'
inputEncoding = 'utf8'
options = {
str("output-xhtml"): True, #"output-xml" : True
str("quiet"): True,
str("show-errors"): 0,
str("force-output"): True,
str("numeric-entities"): True,
str("show-warnings"): False,
str("input-encoding"): inputEncoding,
str("output-encoding"): "utf8",
str("indent"): False,
str("tidy-mark"): False,
str("wrap"): 0
};
document, errors = tidylib.tidy_document(html, options=options)

- 211
- 2
- 4