1

Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html.

Junior Mayhé
  • 16,144
  • 26
  • 115
  • 161
Timmy
  • 12,468
  • 20
  • 77
  • 107

3 Answers3

12

I believe that, this code can help you:

from lxml.html.clean import Cleaner

html_text = "<html><head><title>Hello</title><body>Text</body></html>"
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False)
cleaned_text = cleaner.clean_html(html_text)
dni
  • 131
  • 1
  • 3
  • After a quick experiment this solution seems to be doing a much better job than this one for instance http://stackoverflow.com/a/5332984/787842, but what I'd like to know more about is the way to properly parametrize the `Cleaner` object (as there are many, many options); for instance in this case, having an empty `allow_tags` list and `remove_unknown_tags` set to `False` looks to me a bit weird, logically. – cjauvin May 11 '15 at 14:40
  • @cjauvin: Ofcourse, you are right! It's a kind of hack. But I'm sure no one wants to specify all the tags necessary to remove in the argument `remove_tags`, if they want to remove all of them. Unfortunately in this case implementation of `Cleaner` encourages users use `allow_tags` with `remove_unknown_tags` for this purposes https://github.com/lxml/lxml/blob/54a8bfedcd0f32274a4ebf9e2d8e391fe759aba5/src/lxml/html/clean.py#L387 – dni May 13 '15 at 12:31
  • This wraps the result in a div – cmc Jan 16 '19 at 07:55
12

Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example:

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()
Steven
  • 28,002
  • 5
  • 61
  • 51
  • I want to get rid of everything, not just unsafe tags – Timmy Oct 20 '10 at 13:26
  • 1
    If you want to get rid of everything, why not just `text=''`? ;-) Seriously, `text_content()` WILL get rid of all markup, but cleaning will also get rid of eg. css stylesheet rules and javascript, which are also encoded as text *inside* the element (but I assumed you were only interested in the "real" text, hence the cleanup first) – Steven Oct 20 '10 at 14:09
  • was using clean_html( string ) which does differnet things – Timmy Oct 20 '10 at 20:18
  • when i use html.fromstring instead of html.parse , i get an error ""AttributeError: 'HtmlElement' object has no attribute 'getroot'"" – kommradHomer Jul 22 '14 at 08:00
  • 1
    @kommradHomer: that is because `parse()` returns an elementtree, but `fromstring()` returns an element (so you don't need the `getroot()` in your case) – Steven Jul 22 '14 at 09:16
1

This uses lxml's cleaning functions, but avoids the result being wrapped in an HTML element.

import lxml

doc = lxml.html.document_fromstring(str) 
cleaner = lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False)
str = cleaner.clean_html(doc).text_content() 

or as a one liner

lxml.html.clean.Cleaner(allow_tags=[''], remove_unknown_tags=False).clean_html(lxml.html.document_fromstring(str)).text_content()

It works by providing parsing the html manually into a document object, and giving that to the cleaner class. That way clean_html also returns an object rather than a string. Then the text can be recovered without a wrapper element using text_content() method.

cmc
  • 4,294
  • 2
  • 35
  • 34