I would like to use bleach to format some potentially unclean HTML. In the following sample, ideally bleach should remove:
- the extra spaces in the first opening
<p >
- the attribute in the closing link tag
</a attr="test">
- the extra spaces in the last closing
</p >
My code looks like this:
import bleach
html = """<p >This <a href="book"> book </a attr="test"> will help you</p >"""
html_cleaned = bleach.clean(html)
# html_cleaned is:
#'<p >This <a href="book"> book </a> will help you</p>'
As you can see, bleach is very inconsistent:
- the < and > of the opening and closing
p
tag are escaped to<
and>
. For the link tag, this doesn't happen - The spaces in
</p >
are removed, in the opening<p >
they are not - additionally, if I add an attribute to the closing
p
tag,</p attr="test">
, it is not removed, while for the closing</a attr="test">
the illegal attribute is removed.
What is happening here ?
is not allowed by default?
– Drublic Dec 15 '19 at 12:00seems to be one of the most basic building blocks of html.
– lhk Dec 15 '19 at 12:31