6

I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good way to do that in python?

I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone

thanks!

James
  • 15,085
  • 25
  • 83
  • 120

6 Answers6

5

The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text
felipe.zkn
  • 2,012
  • 7
  • 31
  • 63
g.d.d.c
  • 46,865
  • 9
  • 101
  • 111
4

You could try the rather excellent Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.

pycruft
  • 66,157
  • 1
  • 20
  • 12
  • I tried to use Beautiful soup but a high percentage of time it exception'd out due to bad html which is no bueno – James Jun 05 '10 at 14:29
3

As per here:

def remove_html_tags(data):
     p = re.compile(r'<.*?>')
     return p.sub('', data)

As he notes in the article, the "re module needs to be imported in order to use regular expression."

Oren Hizkiya
  • 4,420
  • 2
  • 23
  • 33
2

The lxml.html module is worth considering. However, it takes a bit of massaging to remove the CSS and JavaScript:

def stripsource(page):
    from lxml import html

    source = html.fromstring(page)
    for item in source.xpath("//style|//script|//comment()"):
        item.getparent().remove(item)

    for line in source.itertext():
        if line.strip():
            yield line

The yielded lines can be simply concatenated, but that can lose significant word boundaries, if there isn't any whitespace around whitespace-generating tags.

You might also want to iterate over just the <body> tag, depending on your requirements.

eswald
  • 8,368
  • 4
  • 28
  • 28
2

I would also recommend BeautifulSoup, but I would recommend using something like on the answer to this question which I'll copy here for those who don't want to look there:

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

I tried it on this page for instance and it worked quite well.

Community
  • 1
  • 1
Justin Peel
  • 46,722
  • 6
  • 58
  • 80
1

This was the cleanest and simplest solution I found to strip CSS and JavaScript:

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))

https://stackoverflow.com/a/3002599/1203188 by Matthew Flaschen

Community
  • 1
  • 1