regular expression to extract text from HTML

Question

I would like to extract from a general HTML page, all the text (displayed or not).

I would like to remove

any HTML tags
Any javascript
Any CSS styles

Is there a regular expression (one or more) that will achieve that?

See http://stackoverflow.com/questions/37486/filter-out-html-tags-and-resolve-entities-in-python, also. — S.Lott, Oct 08 '08 at 09:52
[Beware of Zalgo](http://stackoverflow.com/a/1732454/135078) — Kelly S. French, Jan 12 '12 at 22:49

score 22 · Answer 1 · answered Oct 08 '08 at 01:53

22

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>

answered Oct 08 '08 at 01:53

nickf

537,072
198
649
721

11

/<(.|\n)*?>/g will take you to paradise city. – FUD May 25 '12 at 08:00

S.Lott · Accepted Answer · 2009-05-28T02:00:25.347

14

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.

Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

edited May 28 '09 at 02:00

answered Oct 08 '08 at 02:01

S.Lott

384,516
81
508
779

Definitely use a specialized HTML parser - don't roll your own! I just wanted to suggest Hpricot if you're using Ruby. – Neall Oct 08 '08 at 02:52
Why should baffle a RE? Most would just be setup to ignore it, which is correct: it's text, not HTML. If it's because they parse HTML entities (a good idea I suppose) you should be doing that on the text AFTER your RE's, not on the HTML anyway... – Matthew Scharley Oct 08 '08 at 10:19
4

@monoxide: My point is not that it's impossible. My point is that you can save a lot of debugging of RE's by using someone else's parser that handles all the edge cases correctly. – S.Lott Oct 08 '08 at 12:36
+1 but I think the point about malformed HTML is irrelevant here since we specifically aren't trying to parse the HTML it's ok to have a regex which just pulls out anything which looks like a tag regardless of structure. – annakata Dec 08 '08 at 11:21
@annakata: "pulling out anything which looks like a tag" more-or-less IS parsing. Because HTML is a language that is more complex than RE's are designed to describe, parsing is about the only way to find anything in HTML. RE's are always defeated except in trivial cases. – S.Lott Dec 08 '08 at 11:25
BeautifulSoup uses regexs to parse HTML so it is easily fooled. http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value – jfs Feb 01 '09 at 08:17

score 7 · Answer 3 · answered Dec 26 '12 at 17:04

Needed a regex solution (in php) that would return the plain text just as well (or better than) PHPSimpleDOM, only much faster. Here is the solution that I came up with:

function plaintext($html)
{
    // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
    $plaintext = preg_replace('#<!--.*?-->#s', '', $html);

    // put a space between list items (strip_tags just removes the tags).
    $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);

    // remove all script and style tags
    $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);

    // remove br tags (missed by strip_tags)
    $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);

    // remove all remaining html
    $plaintext = strip_tags($plaintext);

    return $plaintext;
}

When I tested this on some complicated sites (forums seem to contain some of the tougher html to parse), this method returned the same result as PHPSimpleDOM plaintext, only much, much faster. It also handled the list items (li tags) properly, where PHPSimpleDOM did not.

As for the speed:

SimpleDom: 0.03248 sec.
RegEx: 0.00087 sec.

37 times faster!

Best solution by far! Easy to use! Thanks so much! – Joe Nov 04 '15 at 04:02 — Joe, Nov 04 '15 at 04:02

Chris Noe · Answer 4 · 2008-10-08T12:38:51.773

4

Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be:

//body//text()[not(ancestor::script)][not(ancestor::style)]

edited Oct 08 '08 at 12:38

answered Oct 08 '08 at 01:53

Chris Noe

36,411
22
71
92

1

Simple and Elegant == Beautiful. – Pablo Fernandez Oct 08 '08 at 01:56
That would probably work, except that it would also return text (ie. code) from within – Kibbee Oct 08 '08 at 02:00
True enough, see edit. There may be other special cases, but that's the general idea. – Chris Noe Oct 08 '08 at 02:19
Will not work on real world HTML pages, ie the HTML is malformed non-XHTML. Most XML parsers don't support "real-world HTML". That's why I've used HtmlAgilityPack (Google it) for exactly this type of task in the past. – Ash Apr 29 '09 at 08:42
Indeed, that is a consistent pain. Another option is to pre-process the page with tidy. – Chris Noe Apr 29 '09 at 18:17

score 2 · Answer 5 · answered Oct 08 '08 at 01:51

Using perl syntax for defining the regexes, a start might be:

!<body.*?>(.*)</body>!smi

Then applying the following replace to the result of that group:

!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi

This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right). A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.

score 2 · Answer 6 · answered Apr 21 '10 at 19:04

The simplest way for simple HTML (example in Python):

text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])

Returns this:

'This is my> example HTML, containing tags'

score 2 · Answer 7 · answered Jan 09 '11 at 10:14

Here's a function to remove even most complex html tags.

function strip_html_tags( $text ) 
{

$text = preg_replace(
    array(
        // Remove invisible content
        '@<head[^>]*?>.*?</head>@siu',
        '@<style[^>]*?>.*?</style>@siu',
        '@<script[^>]*?.*?</script>@siu',
        '@<object[^>]*?.*?</object>@siu',
        '@<embed[^>]*?.*?</embed>@siu',
        '@<applet[^>]*?.*?</applet>@siu',
        '@<noframes[^>]*?.*?</noframes>@siu',
        '@<noscript[^>]*?.*?</noscript>@siu',
        '@<noembed[^>]*?.*?</noembed>@siu',

        // Add line breaks before & after blocks
        '@<((br)|(hr))@iu',
        '@</?((address)|(blockquote)|(center)|(del))@iu',
        '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
        '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
        '@</?((table)|(th)|(td)|(caption))@iu',
        '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
        '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
        '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(
        ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
        "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
        "\n\$0", "\n\$0",
    ),
    $text );

// Remove all remaining tags and comments and return.
return strip_tags( $text );
    }

score 1 · Answer 8 · answered Oct 08 '08 at 01:51

If you're using PHP, try Simple HTML DOM, available at SourceForge.

Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &).

Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.

score 1 · Answer 9 · answered Oct 08 '08 at 02:38

1

I believe you can just do

document.body.innerText

Which will return the content of all text nodes in the document, visible or not.

[edit (olliej): sigh nevermind, this only works in Safari and IE, and i can't be bothered downloading a firefox nightly to see if it exists in trunk :-/ ]

answered Oct 08 '08 at 02:38

olliej

35,755
9
58
55

Nope, that is undefined in FF3 – Chris Noe Oct 08 '08 at 12:49
textContent is a standard equivalent – Kornel Oct 12 '08 at 19:55

score 1 · Answer 10 · answered Oct 01 '11 at 13:59

Can't you just use the WebBrowser control available with C# ?

        System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();
        wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>";
        System.Windows.Forms.HtmlDocument h = wc.Document;
        Console.WriteLine(h.Body.InnerText);

score 1 · Answer 11 · answered Feb 03 '12 at 05:54

1

string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html);
                Regex objRegExp = new Regex("<(.|\n)+?>");
                string replace = objRegExp.Replace(g, "");
                replace = replace.Replace(k, string.Empty);
                replace.Trim("\t\r\n ".ToCharArray());

then take a label and do "label.text=replace;" see on label out put

.

answered Feb 03 '12 at 05:54

mahesh

11
1

instead of "g" put in code of line: string replace = objRegExp.Replace(decode, ""); – mahesh Feb 03 '12 at 05:58
instead of "g" put in code of line: string replace = objRegExp.Replace(decode, ""); – mahesh Feb 03 '12 at 05:58

regular expression to extract text from HTML

11 Answers11

As for the speed:

Linked

Related