1

I'm trying to strip HTML tags from a piece of text. However the trouble is that whatever I use - regex, strip_tags etc.. Comes up across the same problem: It will also strip text which is not HTML but looks like it.

Some <foo@bar.com> Content--> Some Content
Some <Content which looks like this --> Some 

Is there a way I can get around this?

Jonathan
  • 585
  • 7
  • 27
  • It'll be a problem - validity of tags depends on the HTML flavour you are using. Something valid in transitional will not be valid in strict, comething valid in html5 will be invalid in xhtml, and so on. – Mołot Jul 19 '13 at 10:26
  • I'm getting it from user input. I want to strip all tags without getting the problems as above. – Jonathan Jul 19 '13 at 10:28
  • 1
    You also have to deal with people using HTML tags as text. If you aren't allowing any HTML, then just escape it instead of trying to remove it. – Quentin Jul 19 '13 at 10:28
  • Sometimes this can come in as an email - so it can come with tons of HTML tags which are not needed such as styling - escaping will just make it look messy. – Jonathan Jul 19 '13 at 10:29
  • But the point is - sometimes what is a tag in one HTML edition only looks like a tag in other. How do you want to deal with it? Ask user about edition? and what with xhtml extended by use of DTD? – Mołot Jul 19 '13 at 10:30
  • If it's an HTML formatted email, then you can just show it through an HTML parser and dump it to text. You could even pipe it though lynx. – Quentin Jul 19 '13 at 10:30
  • I want to do it just using PHP so lynx and the server does not have DOMDocument installed. Are there other PHP parsers out there? – Jonathan Jul 19 '13 at 10:33

2 Answers2

3

A fully correct solution would be a full-fledged HTML parser. See this legendary question for a full discussion.

A simple 80% solution would be to look for all known tags and strip them.

RegExp('</?(a|b|blockquote|cite|dd|dl|dt|...|u)\b.*?>')

The code would be more readable if you use an array of tags and build expressions as you loop through them. It will not handle comments nicely, so if you need more than hack quality, don't do it with a hack approach. If you need correctness, use an actual HTML parser (e.g. DOMDocument in PHP).

Community
  • 1
  • 1
Patrick Fisher
  • 7,926
  • 5
  • 35
  • 28
2

Have you tried the HTML purifier library? You can configure it to strip all tags out, I've found the library very reliable.

Touh312
  • 21
  • 1