1

I am trying to make a so called text cleaner so that I could get rid of a few html elements without using the strip_tags() function.

My regex looks like this: <em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>

My code looks like this:

$string = "some very messy string here ";
$pattern = '<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>';
$replace = ' ';

$clean =  preg_replace($pattern, $replace, $string);

echo $clean;

For reasons that are beyond my understanding the echo returns nothing.

Thank you for your time

UPDATE #1

If you are asking if I want to get rid of the tables with all the content inside them the answer is yes.

Mike
  • 3,017
  • 1
  • 34
  • 47
  • what is the objective of this code - why do you want to avoid using strip_tags? – AD7six Oct 13 '12 at 14:52
  • Strip tags would not delete the content of tables which I would like to do. – Mike Oct 13 '12 at 14:55
  • You're better off not using a regex to pseudo-parse html. strip tags will strip tags, and if you want to remove tables - write a routine to remote tables. you're going to get weird results with e.g.: "...
    ...
    ...".
    – AD7six Oct 13 '12 at 15:00
  • He would have to run the replacement multiple times to get rid of nested tables. – Martin Ender Oct 13 '12 at 15:02
  • @m.buettner wouldn't work, after running it the first time the input string would be "before table string...after table string" there would be no to match, a subsequent pass would not remove it. relevant http://stackoverflow.com/a/1732454/761202
    – AD7six Oct 13 '12 at 15:05
  • @AD7six ah right... I tend to forget that ungreedy strings are only ungreedy about the end of the match, not about it's beginning. – Martin Ender Oct 13 '12 at 15:06

2 Answers2

4

Your regular expression needs delimiters. For example:

$pattern = '~<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>~';

Read up on delimiters here.

Also note that some HTML specifications (all but XHTML as far as I know) allow uppercase tags, too. So consider adding the modifier for case-insensitivity to your regular expression. Furthermore, removing tables might not work if there are linebreaks between the opening and closing tags (because . does not match line breaks by default). Add the DOTALL modifier s to solve this:

$pattern = '~<em>|</em>|<p[^>]*>|</p[^>]*>|<span[^>]*>|</span[^>]*>|<div[^>]*>|</div[^>]*>|&nbsp;|<table[^>]*>(.*?)</table[^>]*>~is';

One final note: as the others pointed out regex solutions to HTML problems should be taken with a grain of salt. Nested tables will cause issues, as will comments. If you know the data you are dealing with very well, the problem might be much less complex than general HTML. But be sure your code is at least valid and you know about all oddities like nested structures and HTML characters in comments and so on.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • That did it but I think something is broken in the definition of the regex because it does not remove tables. – Mike Oct 13 '12 at 14:57
  • 1
    `.` does not match line breaks by default. add another modifier after the `i`: `s` .. it's called the DOTALL modifier and now the dot will also match linebreaks... I'll add it to the answer – Martin Ender Oct 13 '12 at 14:58
3

First of all have a look at this answer. This should set things straight from the beginning. If after you've read the answer still want to proceed, I give you the following:

I want to <em<p>>emphasize</<p>em> that it's not possible!

Try to clean that!

Community
  • 1
  • 1
aefxx
  • 24,835
  • 6
  • 45
  • 55
  • 1
    Technically he is not trying to parse it. Also, is this even valid HTML? If so, what would the semantics of this be. Lastly you could probably solve it, by asserting that there are also no opening `<` before the close `>` and then running the replacement multiple times. – Martin Ender Oct 13 '12 at 15:01
  • Could not agree more with that! But here the data looks quite uniform and I have to choose between this regex or clean some 5000 articles by hand, which would not be clever or effective. – Mike Oct 13 '12 at 15:03
  • 1
    @m.buettner Did you even read the link i've posted? I don't care whether it is valid HTML, it's not the client's (neither a hacker's) responsibility to provide valid HTML. Go on, come up with a regex that catches my sentences and I'll get back to you with a even more complex one, hrhrhr. – aefxx Oct 13 '12 at 15:03
  • @mugur Please do yourself a favor and use some sophisticated tools like http://tidy.sourceforge.net/ – aefxx Oct 13 '12 at 15:04
  • 1
    @aefxx I have read and posted that link myself several dozen times. And writing a regex that can also catch strings that are not valid for the set problem is rarely possible, is it? I totally agree with you that HTML is too complex for regular expressions, but sometimes they still get the job done. – Martin Ender Oct 13 '12 at 15:05
  • @m.buettner Sometimes ... that is when you have total control over the input and you KNOW that your regex will catch all eventualities. I doubt this is the case. – aefxx Oct 13 '12 at 15:07
  • Fair enough. Apparently it solved his problem, and he learned something about delimiters and modifiers. I'm just saying that - while regexes generally can't parse HTML - one should give the problem at hand a thought anyway before ruling them out categorically. – Martin Ender Oct 13 '12 at 15:11
  • 1
    @m.buettner I'm ruling it out that drastically because it is the wrong tool for the job. Even if it fixes his problem superficially it still does tear a whole in his application. – aefxx Oct 13 '12 at 15:19