How to remove links from HTML completely with Bleach?

Question

Bleach strips non-whitelisted tags from HTML, but leaves child nodes, e.g.

>>> import bleach
>>> bleach.clean("<a href="">stays</a>", strip=True, tags=[])
'stays'
>>>

How can the entire element along with its children be removed?

score 0 · Answer 1 · answered Sep 01 '20 at 21:26

You should use lxml. Bleach is simply for cleaning data & ensuring security/safety in the markup you store.

You can use lxml to parse structured data like HTML or XML.

Consider a simple html file;

<html>
<body>
<p>Hello, World!</p>
</body>
</html>

from lxml import html

root = html.parse("hello_world.html").getroot()

print(html.tostring(root))

# <html><body><p>Hello, World!</p></body></html>

p = root.find("body/p")

p.drop_tree()

print(html.tostring(root))

# <html><body></body></html>

On a related note, if you want to look into some more advanced parsing with lxml, one of my oldest questions on here was around getting python to parse xml & write python code out of it. Writing a Python tool to convert XML to Python?

How to remove links from HTML completely with Bleach?

1 Answers1