Strip only valid html

Question

I'm trying to strip HTML tags from a piece of text. However the trouble is that whatever I use - regex, strip_tags etc.. Comes up across the same problem: It will also strip text which is not HTML but looks like it.

Some <foo@bar.com> Content--> Some Content
Some <Content which looks like this --> Some

Is there a way I can get around this?

It'll be a problem - validity of tags depends on the HTML flavour you are using. Something valid in transitional will not be valid in strict, comething valid in html5 will be invalid in xhtml, and so on. — Mołot, Jul 19 '13 at 10:26
I'm getting it from user input. I want to strip all tags without getting the problems as above. — Jonathan, Jul 19 '13 at 10:28
You also have to deal with people using HTML tags as text. If you aren't allowing any HTML, then just escape it instead of trying to remove it. — Quentin, Jul 19 '13 at 10:28
Sometimes this can come in as an email - so it can come with tons of HTML tags which are not needed such as styling - escaping will just make it look messy. — Jonathan, Jul 19 '13 at 10:29
But the point is - sometimes what is a tag in one HTML edition only looks like a tag in other. How do you want to deal with it? Ask user about edition? and what with xhtml extended by use of DTD? — Mołot, Jul 19 '13 at 10:30
If it's an HTML formatted email, then you can just show it through an HTML parser and dump it to text. You could even pipe it though lynx. — Quentin, Jul 19 '13 at 10:30
I want to do it just using PHP so lynx and the server does not have DOMDocument installed. Are there other PHP parsers out there? — Jonathan, Jul 19 '13 at 10:33

score 3 · Accepted Answer · edited May 23 '17 at 12:03

3

A fully correct solution would be a full-fledged HTML parser. See this legendary question for a full discussion.

A simple 80% solution would be to look for all known tags and strip them.

RegExp('</?(a|b|blockquote|cite|dd|dl|dt|...|u)\b.*?>')

The code would be more readable if you use an array of tags and build expressions as you loop through them. It will not handle comments nicely, so if you need more than hack quality, don't do it with a hack approach. If you need correctness, use an actual HTML parser (e.g. DOMDocument in PHP).

edited May 23 '17 at 12:03

Community

1
1

answered Jul 19 '13 at 10:40

Patrick Fisher

7,926
5
35
28

I tried that but then it failed with anything beginning with a tag such as - Matching on html tag . – Jonathan Jul 19 '13 at 10:44
The \b (word boundary) takes care of that case. – Patrick Fisher Jul 19 '13 at 10:45

score 2 · Answer 2 · answered Jul 19 '13 at 10:40

2

Have you tried the HTML purifier library? You can configure it to strip all tags out, I've found the library very reliable.

answered Jul 19 '13 at 10:40

Touh312

21
1

I don't have control over what libraries I install. So this won't work for me. – Jonathan Jul 19 '13 at 10:46
@Bonzo: It's just PHP code. You unpack it into a directory and put `require_once '/path/to/HTMLPurifier.auto.php';` in your code to load it. – Ilmari Karonen Jul 19 '13 at 11:27

Strip only valid html

2 Answers2