4

The below code works perfect on XAMPP on my PC, but does not work on my newly bought VPS. It crashed my code.

preg_match_all( "/$regex/siU" , $string , $matches , PREG_SET_ORDER );

This is expected to simply fetch links and titles from HTML.

Previously, a similar regex problem occurred today. Code was running fine on local server, but creating "Connection Was Reset" error on vps. The problem was caused by some commented html (having php code inside it) that was removed using the below code to optimize output, but even the problem of connection reset is resolved, HTML still has comments in browser source.

$string = preg_replace( '/<!--(.|\s)*?-->/' , ''    , $string );

So, problem is clear. These regex functions are not working fine. But i do not know the solution.

Can anyony help me in solving this.

Solved:

Thanks to https://stackoverflow.com/a/12761686/369005 @vimishor

Community
  • 1
  • 1
Hamid Sarfraz
  • 1,089
  • 1
  • 14
  • 34
  • phpinfo() shows Configure Command '--with-pcre-regex=/opt/pcre' so PCRE is installed. – Hamid Sarfraz Oct 06 '12 at 16:08
  • Has it anything to do with server logs? – Hamid Sarfraz Oct 06 '12 at 16:08
  • 1
    The configure command has little to do with it; you need to find out why the processes are dying .. seems like your pcre has some linkage issues. – Ja͢ck Oct 06 '12 at 16:14
  • Apache error log file is empty. Seen using vim error_log in /var/log/httpd/ – Hamid Sarfraz Oct 06 '12 at 16:32
  • Just read the question again and I may have misunderstood; the "connection reset" happens because the regular expression doesn't get applied properly? You're performing regexp on HTML that's actually ran on your server?! – Ja͢ck Oct 06 '12 at 16:35
  • @ Jack, Yes Connection reset was caused by some php code inside html comments. I left the code intentionally coz noone was going to see it, but regex function stopped removing comments from HTML files, causing php code to run with some wrong parameters. Anyways, the problem still is that regex functions are not working. – Hamid Sarfraz Oct 06 '12 at 16:47

5 Answers5

2

Is known the fact that PCRE has sometimes a few problems with text larger than 200 lines. Developers from Drupal and GeSHi were hit by this problem in the past.

References:

  1. Drupal PCRE Issue @ March 23, 2012
  2. GeSHi PCRE Issue @ February 02, 2012

Maybe if you can split the text into small chunks (100 lines for example) and run regex on each chunk, may help.

Alexandru Guzinschi
  • 5,675
  • 1
  • 29
  • 40
  • Ok, let me try. I will reply on the results soon. – Hamid Sarfraz Oct 06 '12 at 16:48
  • Added this code before applying regex. Still nothing. $string = str_replace( array( "\r" , "\n" ) , " " , $string ); – Hamid Sarfraz Oct 06 '12 at 17:08
  • You replaced Mac style line endings with Unix style line endings. My suggestion was to select only the first 100 lines of text and run regex on those lines only ; repeat the procedure until you finish entire text. – Alexandru Guzinschi Oct 06 '12 at 17:22
  • Well the above example replaces both mac and unix style new lines with a space. And, this resolved the issue. You were correct. The problem was with the PCRE issue mentioned above. Thanks for your answer. – Hamid Sarfraz Oct 06 '12 at 17:56
  • Ah, you are right. I missed the `array()` inside `str_replace`. Sorry for that. I'm glad you solved the problem. Best regards. – Alexandru Guzinschi Oct 06 '12 at 18:08
1

Let me stop you there for a second. Parsing HTML with regular expressions is a bad idea, unless it's a very isolated issue on a malformed document. You will want to use a proper parser; for instance, here's an example that strips HTML comments:

$html = <<<EOM
<html>
<body>
<div id="test">
<!--
comment here
-->
</div>
</body>
</html>
EOM;

$d = new DOMDocument;
$d->loadHTML($html);

$x = new DOMXPath($d);

foreach ($x->query('//comment()') as $node) {
        $node->parentNode->removeChild($node);
}

echo $d->saveHTML();
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • 1
    Good example. But, even if i implement this code, the issue is still there. The problem is with all PCRE functions. – Hamid Sarfraz Oct 06 '12 at 16:59
  • @HamidSarfraz You have yet to give an example of when it gives a different result, with an input, expression and outputs from both servers. – Ja͢ck Oct 06 '12 at 17:01
  • Ok, this will take a long post, so I will answer to the question below with all details that i think will be useful. Please wait a little. – Hamid Sarfraz Oct 06 '12 at 17:11
  • Thanks jack. The problem has been resolved. Thanks for your cooperation. Please keep helping others specially new ones like me. – Hamid Sarfraz Oct 06 '12 at 18:02
1

So the root problem is that the code that's supposed to remove HTML comments isn't working? That's probably because the regex that's supposed to match the comments uses (.|\s)* to work around the fact that . doesn't match newlines. That's almost guaranteed to cause problems, as this answer explains.

The correct way to match anything-including-newlines is to use the s modifier. For example:

'/<!--.*?-->/s'

That turns on single-line mode (also known as DOTALL mode), which allows the . to match newlines. (The author of that other question had to use [\S\s] instead, because JavaScript has no equivalent for single-line/DOTALL mode.)

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

It seems the problem is you are misunderstanding what html comments do. According to your comment below your question, the problem is that html comments were not removed, causing php to run with the wrong parameters.

However, html comments have no influence on php code that is or is not run, only on what the browser displays (and runs in case of javascript). Your php code is run before the output gets to the browser.

If you want to comment php code out, you will need to put in in a /* */ block or start each line with //.

jeroen
  • 91,079
  • 21
  • 114
  • 132
  • Maybe i failed to explain. Lets try it again. Suppose the comments the second regex function above was supposed to remove the comments including php code. But, on VPS, it is not doing what it was there to do. Leaving comments and php code intact. Now, imagine the code calls a function that was not declared anywhere (removed from included functions). The php code was not removed for reference purposes or so that i could see what the script looked in the past. And also there was no security risk because it was not going to execute. – Hamid Sarfraz Oct 06 '12 at 16:57
  • @Hamid Sarfraz If you are parsing php scripts with your regex, you are right and my answer does not apply. I assumed you were parsing html pages. – jeroen Oct 06 '12 at 17:00
-1

Try this:

$string = preg_replace( '/.*<!--(.|\s)*?-->.*/' , ''    , $string );

Some regex implementations will execute your regular expression like this: /^<!--(.|\s)*?-->$/. So your expression may behave different on different servers.