-4

I am looking for a better way to code plugins for my web proxy.It involves parsing the the html of the page the user wants , strip out unless stuff(ads , annoying js etc...) and serve the page to the user.

The strip out unless stuff part is done using preg_replace and regex.Yes i am aware that DOMDocument is recommended over regex but preg_replace is faster.Performance is paramount here as i need to serve the user asap to free system resources.

Here is an example of a typical preg_replace statement

$input = preg_replace('#<div id="above-related".*?</div>#s', '', $input); In one typical plugin there might be 4-15 preg_replace statements.

What can i optimize the strip out unless stuff part

Community
  • 1
  • 1
user2650277
  • 6,289
  • 17
  • 63
  • 132

1 Answers1

4

You can speed up matching by reducing the number of regular expressions you have, the complexity of the expression and the input size.

For instance for your example: '#<div id="above-related".*?</div>#s'

You can reduce the size of the input by using strpos and substr:

$input = "<html>..</html>";
$offset = 0;
while ($start = strpos('<div id="above-related"', $input, $offset)) {
    $end = strpos("</div>", $input, $start);
    $substr = substr($input, $start, $end); // take the small slice
    $result = preg_replace('#<div id="above-related".*?</div>#s', '', $substr);
    // stitch the input back together:
    $input = substr($input, 0, $start) . $result . substr($input, $end);
    $offset = $start + 1; // continue looking for more matches
}

In the case of your example the replacement doesn't actually use a match so it can be a straight up cut:

$input = "<html>..</html>";
$offset = 0;
$match_start = '<div id="above-related"';
$match_end = '</div>';
while ($start = strpos($match_start, $input, $offset)) {
    $end = strpos($match_end, $input, $start);
    $input = substr($input, 0, $start + strlen($match_start)) . substr($input, $end);
    $offset = $start + 1; // continue looking for more matches
}

The trick here is that strpos and substr are much faster than preg_replace (easily 100x).

If you can find a non-regular expression match, or maybe even a non-regular expression replacement strategy for each rule then you're going to see a significant speed up.

Halcyon
  • 57,230
  • 10
  • 89
  • 128