5

When I execute the following code; I get a seg fault every time! Is this a known bug? How can I make this code work?

<?php
$doc = file_get_contents("http://prairieprogressive.com/");
$replace = array(
    "/<script([\s\S])*?<\/ ?script>/",
    "/<style([\s\S])*?<\/ ?style>/",
    "/<!--([\s\S])*?-->/",
    "/\r\n/"
);
$doc = preg_replace($replace,"",$doc);
echo $doc;
?>

The error (obviously) looks like:

[root@localhost 2.0]# php test.php
Segmentation fault (core dumped)
KeatsKelleher
  • 10,015
  • 4
  • 45
  • 52
  • 1
    Have you ever thought of using a [proper HTML parser](http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php-closed)? – Gumbo Nov 11 '10 at 18:13
  • Just as a note I think you are missing the `>` after the script and style tags. – GWW Nov 11 '10 at 18:14
  • Show us the actual error. If you're getting a segfault it's likely an issue with your PHP installation. Or a bug. Either way, follow @Gumbo's advice and use an HTML parser. – Cfreak Nov 11 '10 at 18:17
  • @Gumbo This is part of the pre-processing to clean up typically troublesome tags before the page is parsed by DOMDocument – KeatsKelleher Nov 11 '10 at 18:17
  • @Cfreak : I think it is a bug, I'm reproducing it on my laptop – greg0ire Nov 11 '10 at 18:18
  • @akellehe: It seems that doing this is more troublesome. :) – Gumbo Nov 11 '10 at 18:19
  • haha... troublesome indeed... i seems the problem lies in the style regex: $doc = preg_replace("/ – KeatsKelleher Nov 11 '10 at 18:20
  • Which version of PHP are you having this bug in? Because in 5.3.x its not there. – Viper_Sb Nov 11 '10 at 18:27

5 Answers5

2

You have unnecessary capture groups that strain PCRE's backtracking. Try this:

$replace = array(
    "/<script.*?><\/\s?script>/s",
    "/<style.*?><\/\s?style>/s",
    "/<!--.*?-->/s",
    "/\r\n/s"
);

Another thing, \s (whitespace) combined with \S (non-whitespace) matches anything. So just use the . pattern.

bcosca
  • 17,371
  • 5
  • 40
  • 51
1

OK! It seems like there is some issue with the () operators...

When I use

$doc = preg_replace("/<style([\s\S]*)<\/ ?style>/",'',$doc);

instead of

$doc = preg_replace("/<style([\s\S])*<\/ ?style>/",'',$doc);

it works!!

KeatsKelleher
  • 10,015
  • 4
  • 45
  • 52
1

This seems to be a bug.

As mentioned by you in the comment, it is the style regex that is causing this. As a workaround you can use the s modifier so that . matches even the newline:

$doc = preg_replace("/<style.*?<\/ ?style>/s",'',$doc);
codaddict
  • 445,704
  • 82
  • 492
  • 529
0

Try this (added option u for unicode and changed ([\s\S])? to .? :

<?php
$doc = file_get_contents("http://prairieprogressive.com/");
$replace = array(
    "#<script.*?</ ?script>#u",
    '#<style.*?</ ?style>#u',
    "#<!--.*?-->#u",
    "#\r\n#u"
);
$doc = preg_replace($replace,"",$doc);
echo $doc;
?>
0

What is the point of [\s\S]? It matches any whitespace character, and any non-whitespace character. If you replace it with .*, it works just fine.

EDIT: If you want to match new lines too, use the s modifier. In my opinion, it is easier to understand than a contradictory [\s\S].

netcoder
  • 66,435
  • 19
  • 125
  • 142