1

I'm trying to run a script here. I did put some content into a variable $x. $x is full of html code. Now I want to replace / remove all html comments and write it to a file.

I have this regex: <!--([\s\S]*?)-->. and it works fine in editors or www.phpliveregex.com. but in my php it doesn't. Maybe you can help me out.

//$x = content
$summary2 = preg_replace("<!--([\s\S]*?)-->", "", $x);
fwrite($fh, $summary2);

Edit: This is some example of the content i want to get rid off.

</ul>
<p>
 Evaluation<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG />
<o:TargetScreenSize>1024x768</o:TargetScreenSize>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Normale Tabelle";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]--></p>
<ul>
 <li>
Kra33
  • 13
  • 3
  • At the very least you appear to be missing any delimiters. – Jonnix Dec 23 '15 at 09:55
  • Surround the contents of the regex in `//`. – ndnenkov Dec 23 '15 at 09:55
  • 2
    @JonStirling - Even worse: `<` and `>` actually act as delimiters [[ref](http://php.net/manual/en/regexp.reference.delimiters.php)] – Álvaro González Dec 23 '15 at 09:56
  • @ÁlvaroGonzález :o good point! – Jonnix Dec 23 '15 at 09:58
  • When you are editing HTML code in PHP, then you should not use regex, but use DOM instead. here you can see a code example on how to do it with DOM http://stackoverflow.com/questions/6305643/remove-comments-from-html-source-code – Oliver Nybroe Dec 23 '15 at 10:01
  • @uruloke Thanks man, i will take a look at this and consider using it instead of regex. But either way i would like to know why this isn't working. – Kra33 Dec 23 '15 at 10:21

4 Answers4

3

What are Regular Expressions?

A sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.

What are delimiters?

When using the PCRE functions, it is required that the pattern is enclosed by delimiters. A delimiter can be any non-alphanumeric, non-backslash, non-whitespace character.

Which pair of characters can be used as delimiters?

Often used delimiters are forward slashes (/), hash signs (#) and tildes (~).

It is also possible to use bracket style delimiters where the opening and closing brackets are the starting and ending delimiter, respectively. (), {}, [] and <> are all valid bracket style delimiter pairs.

What about my case <!--([\s\S]*?)-->?

So your RegEx, incidentally, has delimiters inside which is starting < and ending > characters and correspondingly your RegEx pattern would be !--([\s\S]*?)-- which may not be what you want.

What should I do?

Wrap it within a pair of delimiters. E.g. /<!--([\s\S]*?)-->/

Does it work?

Check it live

Is it a good practice?

No, it is not! Never (but to not lie about it I do it sometimes!)! Regular Expressions are not made to modify HTML/XML elements. You should go with DOMDocument class for this specific purpose which will make your life much more easier and cleaner:

$dom = new DOMDocument();
$dom->loadHtml($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}
echo $dom->saveHTML();

Check it live

Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117
  • But I'd add an explanation (not acting on the original string) as well (+1 nevertheless). – Jan Dec 23 '15 at 11:00
0

Since you use < and > as delimiters, you should want to escape them to remove them from your string :

$summary2 = preg_replace("<\<!--([\s\S]*?)--\>>", "", $x);
jiboulex
  • 2,963
  • 2
  • 18
  • 28
0

First of all, you forgot to add delimiters.

Usually, a warning is issued when you don't have delimiters, as it's considered as a regex syntax error. In your particular case though, no warning is generated because you can use < and > as delimiters. You could also have used { }. Since your < and > are taken as delimiters, your regexp obviously don't match what you expect anymore.

Usually, regexp without delimiters works in testing sites because delimiters are automatically managed without having to take care of it. That certainly explains why your regex works as is on the site where you are testing it.

Secondly, I suggest replacing [\s\S]*? by .*? and use the s option. It's easier to understand what you are trying to match.

QuentinC
  • 12,311
  • 4
  • 24
  • 37
0

In PHP you need to return the string from preg_replace(), it does not work on the original string. So this works flawlessly (see a demo here as well, in the lower half). As mentionned in the comments, you need to add some delimiters as well (in my case ~):

<?php
$string = '</ul>
<p>
    Evaluation<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG />
<o:TargetScreenSize>1024x768</o:TargetScreenSize>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Normale Tabelle";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]--></p>
<ul>
    <li>';

$regex = '~<!--([\s\S]*?)-->~';
$replacement = '';
$newString = preg_replace($regex, $replacement, $string);
echo $newString;

?>
Jan
  • 42,290
  • 8
  • 54
  • 79