22

I would like to extract from a general HTML page, all the text (displayed or not).

I would like to remove

  • any HTML tags
  • Any javascript
  • Any CSS styles

Is there a regular expression (one or more) that will achieve that?

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
Ron Harlev
  • 16,227
  • 24
  • 89
  • 132

11 Answers11

22

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>
nickf
  • 537,072
  • 198
  • 649
  • 721
14

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like &lt;text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.


Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • Definitely use a specialized HTML parser - don't roll your own! I just wanted to suggest Hpricot if you're using Ruby. – Neall Oct 08 '08 at 02:52
  • Why should baffle a RE? Most would just be setup to ignore it, which is correct: it's text, not HTML. If it's because they parse HTML entities (a good idea I suppose) you should be doing that on the text AFTER your RE's, not on the HTML anyway... – Matthew Scharley Oct 08 '08 at 10:19
  • 4
    @monoxide: My point is not that it's impossible. My point is that you can save a lot of debugging of RE's by using someone else's parser that handles all the edge cases correctly. – S.Lott Oct 08 '08 at 12:36
  • +1 but I think the point about malformed HTML is irrelevant here since we specifically aren't trying to parse the HTML it's ok to have a regex which just pulls out anything which looks like a tag regardless of structure. – annakata Dec 08 '08 at 11:21
  • @annakata: "pulling out anything which looks like a tag" more-or-less IS parsing. Because HTML is a language that is more complex than RE's are designed to describe, parsing is about the only way to find anything in HTML. RE's are always defeated except in trivial cases. – S.Lott Dec 08 '08 at 11:25
  • BeautifulSoup uses regexs to parse HTML so it is easily fooled. http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value – jfs Feb 01 '09 at 08:17
7

Needed a regex solution (in php) that would return the plain text just as well (or better than) PHPSimpleDOM, only much faster. Here is the solution that I came up with:

function plaintext($html)
{
    // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
    $plaintext = preg_replace('#<!--.*?-->#s', '', $html);

    // put a space between list items (strip_tags just removes the tags).
    $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);

    // remove all script and style tags
    $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);

    // remove br tags (missed by strip_tags)
    $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);

    // remove all remaining html
    $plaintext = strip_tags($plaintext);

    return $plaintext;
}

When I tested this on some complicated sites (forums seem to contain some of the tougher html to parse), this method returned the same result as PHPSimpleDOM plaintext, only much, much faster. It also handled the list items (li tags) properly, where PHPSimpleDOM did not.

As for the speed:

  • SimpleDom: 0.03248 sec.
  • RegEx: 0.00087 sec.

37 times faster!

Joe Bergevin
  • 3,158
  • 5
  • 26
  • 34