Bleach strips non-whitelisted tags from HTML, but leaves child nodes, e.g.
>>> import bleach
>>> bleach.clean("<a href="">stays</a>", strip=True, tags=[])
'stays'
>>>
How can the entire element along with its children be removed?
You should use lxml
. Bleach is simply for cleaning data & ensuring security/safety in the markup you store.
You can use lxml
to parse structured data like HTML or XML.
Consider a simple html file;
<html>
<body>
<p>Hello, World!</p>
</body>
</html>
from lxml import html
root = html.parse("hello_world.html").getroot()
print(html.tostring(root))
# <html><body><p>Hello, World!</p></body></html>
p = root.find("body/p")
p.drop_tree()
print(html.tostring(root))
# <html><body></body></html>
On a related note, if you want to look into some more advanced parsing with lxml
, one of my oldest questions on here was around getting python to parse xml & write python code out of it. Writing a Python tool to convert XML to Python?