Best way to strip out everything but text from a webpage?

Question

I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good way to do that in python?

I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone

thanks!

score 5 · Answer 1 · edited Apr 28 '15 at 13:22

5

The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text

edited Apr 28 '15 at 13:22

felipe.zkn

2,012
7
31
63

answered Jun 04 '10 at 21:38

g.d.d.c

46,865
9
101
111

score 4 · Answer 2 · answered Jun 04 '10 at 22:28

You could try the rather excellent Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.

I tried to use Beautiful soup but a high percentage of time it exception'd out due to bad html which is no bueno — James, Jun 05 '10 at 14:29

score 3 · Answer 3 · answered Jun 04 '10 at 21:28

3

As per here:

def remove_html_tags(data):
     p = re.compile(r'<.*?>')
     return p.sub('', data)

As he notes in the article, the "re module needs to be imported in order to use regular expression."

answered Jun 04 '10 at 21:28

Oren Hizkiya

4,420
2
23
33

The wolves are gonna get you for this one. – jathanism Jun 04 '10 at 21:30
Yea, normally I'm against the use of regular expressions to parse HTML, but this seems like a simple enough approach. – Oren Hizkiya Jun 04 '10 at 21:31
But of course it will also strip code examples... if there are any... just a thought :) – Felix Kling Jun 04 '10 at 21:58
2

Hmmm - doesn't get rid of javascript, just the – thetaiko Jun 04 '10 at 22:21
1

that doesn't strip css, javascript or embedded things as found on yahoo.com – James Jun 05 '10 at 14:30

eswald · Answer 4 · 2010-06-04T22:31:36.767

The lxml.html module is worth considering. However, it takes a bit of massaging to remove the CSS and JavaScript:

def stripsource(page):
    from lxml import html

    source = html.fromstring(page)
    for item in source.xpath("//style|//script|//comment()"):
        item.getparent().remove(item)

    for line in source.itertext():
        if line.strip():
            yield line

The yielded lines can be simply concatenated, but that can lose significant word boundaries, if there isn't any whitespace around whitespace-generating tags.

You might also want to iterate over just the <body> tag, depending on your requirements.

score 2 · Answer 5 · edited May 23 '17 at 10:29

I would also recommend BeautifulSoup, but I would recommend using something like on the answer to this question which I'll copy here for those who don't want to look there:

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

I tried it on this page for instance and it worked quite well.

score 1 · Answer 6 · edited May 23 '17 at 12:34

1

This was the cleanest and simplest solution I found to strip CSS and JavaScript:

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))

https://stackoverflow.com/a/3002599/1203188 by Matthew Flaschen

edited May 23 '17 at 12:34

Community

1
1

answered Jul 28 '13 at 09:18

Peter Long Nguyen

71
1
3

Best way to strip out everything but text from a webpage?

6 Answers6