What's the best way to remove HTML from a string?

Question

I recently started using the following RegEx in a ReReplace() function to strip HTML tags from a string using ColdFusion. Please note: I am not using this as protection from XSS or SQL injection; this is only to remove existing and safe HTML from a string before it's displayed in an HTML title attribute.

REReplaceNoCase(str,"<[^>]*>","","ALL")

In a semi-related question I asked how to modify my RegEx to include spaces and line breaks. I was told that using RegEx for this purpose is not appropriate and this post was referenced as an explanation.

I strongly suspect though that the regular expressions you have posted don't in fact work correctly. I'd advise you not to use regular expressions to parse HTML as HTML is not a regular language. Use an HTML parser instead. (Mark Byers)

If this is true, what is the appropriate tool for removing HTML from a string before it's displayed? (Baring in mind the HTML is already safe; it's sanitized before entry to the DB).

I am aware of HTMLEditFormat() and HTMLCodeFormat(), but those two functions do not provide what I need; the earlier replaces special characters with their HTML-escaped equivalents, while the latter does exactly the same but also wraps the string a <pre> tag.

What I would like to do is clean a string from HTML and line breaks before I display in an HTML title attribute <a title="My string without HTML goes here">...</a>

There are times when the HTML is not necessary. Say you wanted to display an excerpt from a post without the HTML stored along with it, for instance.

score 5 · Accepted Answer · answered Dec 29 '10 at 01:46

5

I disagree with the reasoning you quote. While HTML should not be parsed with regexen, stripping tags is perfect for them.

But you will want to be more careful than just <[^>]*>, since that would turn

<span title=">">...</span>

into the ill-formed

">...</span>

So you need something like <([^">]|"[^"]*"|'[^']*')*> instead. You can strip out line breaks with character replacement instead of a regex, but if you prefer a regex you can use something like \n (or even combine it with the above using alternation, but that's even less efficient).

answered Dec 29 '10 at 01:46

Charles

11,269
13
67
105

Charles, many thanks for the explanation. I will leave this open for a little longer to see if I can encourage any further input. Can you clarify what you meant by using "character replacement"? Is there a particular function that does this? Any further elaboration on the concept would be appreciated. – Mohamad Dec 29 '10 at 02:29
I was thinking of `Replace(str, '\n', '', 'all')`. – Charles Dec 29 '10 at 04:59
Charles, the problem with the string you made is that I can't use it. The double quotes are messing up the function and causing an error, since the whole regEx string has to sit between double quotes... any idea how I can get around this? – Mohamad Jan 03 '11 at 13:09
Back-quote: type `"<([^\">]|\"[^\"]*\"|'[^']*')*>"`. – Charles Feb 17 '13 at 17:57
1

Save the regular expression with into variable with cfsavecontent. For example – BallisticPugh Dec 30 '13 at 16:59
Save the regular expression to a variable using cfsavecontent – BallisticPugh Dec 30 '13 at 17:00

score 1 · Answer 2 · answered Dec 29 '10 at 04:16

1

Use chilkat html parser chilkat. We used this in my academic project to fetch all the content and hyperlinks from html pages to build a basic search engine.

answered Dec 29 '10 at 04:16

A_Var

1,056
1
13
23

Pif · Answer 3 · 2011-01-03T13:47:59.097

If the HTML snippet is to be included in a title, you can probably cover all bases with regexes and enough testing.

Still, as a general hint, if you have to handle a larger snippet, I'd go the XML/DOM way with Java, either by parsing with dom4j and grabbing the text or more likely by Stringbuilding the result with a SAX parser.

[EDIT]When I first answered, I was about to write that the HTML must be reasonably well-formed, but assumed you at least a bit of control on the source. If you don't have it, though, I'll just link quickly to JTidy and TagSoup without, of course, having tested either, but they are definitely the first thing I would test to consume real-world HTML with CF.

What's the best way to remove HTML from a string?

3 Answers3

Linked