1

I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:

  • It's not known ahead of time whether a document contains HTML at all.
  • More likely than not, any HTML will be very poorly formatted.
  • Individual documents might be very large, perhaps hundreds of megabytes.
  • Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of <.+/?> are a no go. (And stripping XML is less desirable, anyway.)

I'm currently using HTML Agility Pack, and it's just not cutting the mustard. Performance is poorer than I'd like, it doesn't always handle truly awful formatting as gracefully as it could, and lately I've been running into problems with stack overflows on some of the more upsettingly large files.

I suspect that all of these problems stem from the fact that it's trying to actually parse the data, which makes it a poor fit for my needs. I don't want a syntax tree; I just want (most of) the tags to go away.

Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that's not such a great idea. But that diatribe's points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?

Assuming it isn't a terrible idea, suggestions for regex that would do a good job are very welcome.

Community
  • 1
  • 1
Sean U
  • 6,730
  • 1
  • 24
  • 43
  • 1
    We all read such a masterpiece ... I'm talking about the answer you cited :) ... anyway we also know it's really difficult to talk in general when using regular expression to handle html data. I don't clearly understand what do you mean for "I just want (most of) the tags to go away. Do you mean specific tags fitting specific criteria, and all their contents? – Diego D Aug 04 '12 at 15:34
  • because since a regex pattern cannot describe hierarchies you may use them just to implement your own parsing strategies extracting portions of text with regular expression and delegate the responsability to keep or remove it from the final result. I have no idea about performance...maybe not good. – Diego D Aug 04 '12 at 15:41
  • @DiegoDeVita What I mean is that I need to strip HTML tags out of the stream and leave the rest of the content as-is. I say "most of" because 100% reliability is not necessary. If the odd browser-specific tag makes it through that's fine, because the data's being passed to software that's designed to accept noisy data. – Sean U Aug 04 '12 at 18:06
  • It depends on how bad formatted the html could possibly be. Something like ` – Gabber Aug 16 '12 at 14:42
  • @Gabber Much better to reject (and not strip) the ` – Sean U Aug 17 '12 at 18:53
  • Understood. The solution basically would then be something like *find any `?(a|href|div|anypossibletagname)[^<]+?>` and replace it with nothing* hoping not to find something like `

    the little boy said

    `, correct?
    – Gabber Aug 20 '12 at 07:46
  • @Gabber - Correct. That's roughly the path I'm experimenting with right now. And to defend against the larger files I think I can just split the input text into smaller chunks that can be processed separately. With some logic to help ensure breaks aren't made mid-tag, of course. – Sean U Aug 20 '12 at 16:28

2 Answers2

1

This regex finds all tags avoiding angle brackets inside quotes in tags.

<[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>

It isn't able to detect escaped quotes inside quotes (but I think it is unnecessary in html)

Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I'm afraid an exact solution can't be found starting with your assumption about angle brackets, think for example to something like <a href="test.html"> b<a </a>...

EDIT:

Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.

Gabber
  • 5,152
  • 6
  • 35
  • 49
  • I ended up going with something much like this. There's actually a series regular expressions that are run: One to handle things where everything between the tags needs to go - scripts, as you suggest, but also headers, styles, and a couple others. A couple to handle specific tags that need to be replaced with whitespace. And one generic one like above that handles everything else, though it did end up being quite a bit more complicated in order to get its false positive rate down to size. – Sean U Aug 23 '12 at 00:21
  • Good! Publish your regex then! (please :) ) – Gabber Aug 23 '12 at 06:48
  • 1
    Here's the general version: '?\w+(?:\s+[-\w:]+(?:=(?:""[^>""]*""|'[^>']*'|[-\w:;,\./#=&_\?@\(\)\+%!\*]*))?)*\s*/?>' The tag-specific ones are created by replacing the leading `\w` and, if close tags shouldn't be replaced, leaving out the leading `/?`. – Sean U Aug 23 '12 at 13:48
1

I'm just thinking outside the box here, but you may consider leveraging something like Microsoft Word, or maybe OpenOffice.

I've used Word automation to translate HTML to DOC, RTF, or TXT. The HTML to TXT conversion native to Word would give you exactly what you want, stripping all of the HTML tags and converting it to text format. Of course this wouldn't be efficient at all if you're processing tons of tiny HTML files since there's some overhead in all of this. But if you're dealing with massive files this may not be a bad choice as I'm sure Word has plenty of optimizations around these conversions. You could test this theory by manually opening one of your largest HTML files in Word and resaving it as a TXT file and see how long Word takes to save.

And although I haven't tried it, I bet it's possible to programmatically interact with OpenOffice to accomplish something similar.

Steve Wortham
  • 21,740
  • 5
  • 68
  • 90