15
from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

sadhu_
  • 187
  • 1
  • 1
  • 7
  • I'm not able to replicate the problem using http://stackoverflow.com/questions/2950131/python-lxml-cleaning-out-html-tags/2950223#2950223 as input. Could you provide a sample of the html and the desired output? – unutbu Jun 01 '10 at 16:45
  • ~unutbu this is most strange - i have a whole database where that code did not work - and yet, it seems to be working just fine now? (did you do something :) ?) but whilst im at it, any idea how you could also take the link-text out, when removing the link (because atm it leaves the text of the links in). – sadhu_ Jun 01 '10 at 18:05
  • @sadhu_: `remove_tags` removes only tags themselves; it leaves its children and text. Use `kill_tags` to remove the whole tree. – jfs Oct 31 '11 at 15:44

3 Answers3

15

solution from David concatenates the text with no separator:

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))
Community
  • 1
  • 1
Robert Lujo
  • 15,383
  • 5
  • 56
  • 73
13

Not sure if this method existed around the time you made your question, but if you go through

document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()

That should return you all the text content in the html document, minus all the markup.

David
  • 17,673
  • 10
  • 68
  • 97
  • Check out Robert's answer below - link for lazy http://stackoverflow.com/a/23929354/9908 – David Sep 14 '14 at 22:18
5

I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Where page is your string of html.

Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
KushalP
  • 10,976
  • 6
  • 34
  • 27
  • 3
    It seems BS is deprecated (and googling seems to suggest lxml is the way forward..)so ideally i wanted to learn some lxml [as the documentation is mildly bewildering..] – sadhu_ Jun 01 '10 at 18:07
  • BS rocks! With 4.0 rc out (a few months ago) you can use the parser from `lxml` or `html5lib` and wrap them in the nice BS api. – Sergio May 17 '11 at 00:07