python [lxml] - cleaning out html tags

Question

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

I'm not able to replicate the problem using http://stackoverflow.com/questions/2950131/python-lxml-cleaning-out-html-tags/2950223#2950223 as input. Could you provide a sample of the html and the desired output? — unutbu, Jun 01 '10 at 16:45
~unutbu this is most strange - i have a whole database where that code did not work - and yet, it seems to be working just fine now? (did you do something :) ?) but whilst im at it, any idea how you could also take the link-text out, when removing the link (because atm it leaves the text of the links in). — sadhu_, Jun 01 '10 at 18:05
@sadhu_: `remove_tags` removes only tags themselves; it leaves its children and text. Use `kill_tags` to remove the whole tree. — jfs, Oct 31 '11 at 15:44

score 15 · Answer 1 · edited May 23 '17 at 11:52

15

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

edited May 23 '17 at 11:52

Community

1
1

answered May 29 '14 at 08:52

Robert Lujo

15,383
5
56
73

1

This is much more useful. – David Sep 14 '14 at 22:17

score 13 · Answer 2 · answered Mar 16 '11 at 23:19

13

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

That should return you all the text content in the html document, minus all the markup.

answered Mar 16 '11 at 23:19

David

17,673
10
68
97

Check out Robert's answer below - link for lazy http://stackoverflow.com/a/23929354/9908 – David Sep 14 '14 at 22:18

score 5 · Answer 3 · edited Aug 03 '12 at 17:47

5

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

edited Aug 03 '12 at 17:47

Bill the Lizard

398,270
210
566
880

answered Jun 01 '10 at 13:39

KushalP

10,976
6
34
27

3

It seems BS is deprecated (and googling seems to suggest lxml is the way forward..)so ideally i wanted to learn some lxml [as the documentation is mildly bewildering..] – sadhu_ Jun 01 '10 at 18:07
BS rocks! With 4.0 rc out (a few months ago) you can use the parser from `lxml` or `html5lib` and wrap them in the nice BS api. – Sergio May 17 '11 at 00:07

python [lxml] - cleaning out html tags

3 Answers3

Linked