I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:
- It's not known ahead of time whether a document contains HTML at all.
- More likely than not, any HTML will be very poorly formatted.
- Individual documents might be very large, perhaps hundreds of megabytes.
- Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of
<.+/?>
are a no go. (And stripping XML is less desirable, anyway.)
I'm currently using HTML Agility Pack, and it's just not cutting the mustard. Performance is poorer than I'd like, it doesn't always handle truly awful formatting as gracefully as it could, and lately I've been running into problems with stack overflows on some of the more upsettingly large files.
I suspect that all of these problems stem from the fact that it's trying to actually parse the data, which makes it a poor fit for my needs. I don't want a syntax tree; I just want (most of) the tags to go away.
Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that's not such a great idea. But that diatribe's points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?
Assuming it isn't a terrible idea, suggestions for regex that would do a good job are very welcome.
the little boy said
`, correct? – Gabber Aug 20 '12 at 07:46