Speedier/less resource-demolishing way to strip html from large files than BeautifulSoup? Or, a better way to use BeautifulSoup?

Question

Currently I am having trouble typing this because, according to top, my processor is at 100% and my memory is at 85.7%, all being taken up by python.

Why? Because I had it go through a 250-meg file to remove markup. 250 megs, that's it! I've been manipulating these files in python with so many other modules and things; BeautifulSoup is the first code to give me any problems with something so small. How are nearly 4 gigs of RAM used to manipulate 250megs of html?

The one-liner that I found (on stackoverflow) and have been using was this:

''.join(BeautifulSoup(corpus).findAll(text=True))

Additionally, this seems to remove everything BUT markup, which is sort of the opposite of what I want to do. I'm sure that BeautifulSoup can do that, too, but the speed issue remains.

Is there anything that will do something similar (remove markup, leave text reliably) and NOT require a Cray to run?

Hmm.. a Cray would probably be slower :P – Billy ONeal Jan 24 '11 at 17:49 — Billy ONeal, Jan 24 '11 at 17:49
Had any luck yet with the HTML stripping? – Acorn Jan 27 '11 at 00:28 — Acorn, Jan 27 '11 at 00:28

score 13 · Answer 1 · edited May 23 '17 at 12:13

13

lxml.html is FAR more efficient.

http://lxml.de/lxmlhtml.html

enter image description here

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Looks like this will do what you want.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

A couple of other similar questions: python [lxml] - cleaning out html tags

lxml.etree, element.text doesn't return the entire text from an element

Filter out HTML tags and resolve entities in python

UPDATE:

You probably want to clean the HTML to remove all scripts and CSS, and then extract the text using .text_content()

from lxml import html
from lxml.html.clean import clean_html

tree = html.parse('http://www.example.com')
tree = clean_html(tree)

text = tree.getroot().text_content()

(From: Remove all html in python?)

edited May 23 '17 at 12:13

Community

1
1

answered Jan 24 '11 at 12:21

Acorn

49,061
27
133
172

Hi, I'm actually looking into that right now. lxml.html doesn't seem to have a straightforward "remove all markup and leave text content" option; I'm downloading lxml as we speak, maybe .parse() is what I'm looking for? Anyway, thanks a bunch. If a simple help(lxml.html.parse) will solve it then, woo. -- even if not, thanks a ton for the input. – WaxProlix Jan 24 '11 at 12:33
@WaxProlix, a simple `.text_content()` should be all that you need, with a cleaning first for good measure :) – Acorn Jan 24 '11 at 13:08
+1 for adding the `clean_html` step. If you skip it, then your output will look pretty bad. For example: `lxml.html.fromstring("foo ").text_content()` will yield `foo bar`, when you probably expect `foo`. – speedplane Feb 17 '14 at 07:57

score 0 · Answer 2 · answered Jan 24 '11 at 12:33

use cleaner from lxml.html:

>>> import lxml.html
>>> from lxml.html.clean import Cleaner
>>> cleaner = Cleaner(style=True) # to delete scripts styles objects comments etc;)
>>> html = lxml.html.fromstring(content).xpath('//body')[0]
>>> print cleaner.clean_html(html)

Speedier/less resource-demolishing way to strip html from large files than BeautifulSoup? Or, a better way to use BeautifulSoup?

2 Answers2

UPDATE: