1

Say we use this preg_replace on millions of post strings:

function makeClickableLinks($s) {
    return preg_replace('@(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)@', '<a href="$1" target="_blank">$1</a>', $s);
}

Asume that only 10% of all the posts contain links, would it be faster to check strpos($string, 'http') !== false before calling preg_replace()? If so, why? Doesn't preg_replace() perform some pretests internally?

Drakes
  • 23,254
  • 3
  • 51
  • 94
mgutt
  • 5,867
  • 2
  • 50
  • 77

2 Answers2

5

Surprisingly, yes!

Here are benchmarks for you to analyze on 10,000,000 strings with both functions:

Test 1 - String that matches the pattern:

"Here is a great new site to visit at http://example.com so go there now!"

preg_replace alone took 10.9626309872 seconds
strpos before preg_replace took 12.6124269962 seconds ← slower

Test 2 - String that doesn't match the pattern:

"Here is a great new site to visit at ftp://example.com so go there now!"

preg_replace alone took 6.51636195183 seconds
strpos before preg_replace took 2.91205692291 seconds ← faster

Test 3 - 10% of the strings match the pattern:

"Here is a great new site to visit at ftp://example.com so go there now!" (90%)
"Here is a great new site to visit at http://example.com so go there now!" (10%)

preg_replace alone took 7.43295097351 seconds
strpos before preg_replace took 4.31978201866 seconds ← faster

It's just a simple benchmark on two strings, but there is a clear difference in speed.


Here is the test harness for the "10%" case:

<?php
$string1 = "Here is a great new site to visit at http://example.com so go there now!";
$string2 = "Here is a great new site to visit at ftp://example.com so go there now!";

function makeClickableLinks1($s) {
    return preg_replace('@(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)@', '<a href="$1" target="_blank">$1</a>', $s);
}

function makeClickableLinks2($s) {
    return strpos($s, 'http') !== false ? preg_replace('@(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?)@', '<a href="$1" target="_blank">$1</a>', $s) : null;
}

/* Begin test harness */

$loops = 10000000;

function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}

/* Test using only preg_replace */

$time_start = microtime_float();
for($i = 0; $i < $loops; $i++) {
    // Only 10% of strings will have "http"
    makeClickableLinks1($i % 10 ? $string2 : $string1);
}
$time_end = microtime_float();
$time = $time_end - $time_start;
echo "preg_replace alone took $time seconds<br/>";

/* Test using strpos before preg_replace */

$time_start = microtime_float();
for($i = 0; $i < $loops; $i++) {
    // Only 10% of strings will have "http"
    makeClickableLinks2($i % 10 ? $string2 : $string1);
}
$time_end = microtime_float();
$time = $time_end - $time_start;
echo "strpos before preg_replace took $time seconds<br/>";
?>
Drakes
  • 23,254
  • 3
  • 51
  • 94
  • Thank you. A minor optimization could be to avoid the function call at all through `if (strpos($i % 10 ? $string2 : $string1, 'http') !== false) makeClickableLinks1($i % 10 ? $string2 : $string1);`. – mgutt Apr 19 '15 at 12:32
  • Hi, nice question by the way. You can't know ahead of time if a string has the "http". So the only way to proceed is to either _always_ use `strpos` in your function, or _never_ use it. These two cases are exactly what are being tested, and for clarity there is the 3rd case, but some `makeClickableLinks` function must always be called for a valid test. Hope that helps. – Drakes Apr 19 '15 at 12:39
  • For me its a very important answer as I'm having multiple `preg_replace()`'s and some with callbacks to automatically generate links, lists, bbcodes, etc. in forum posts and I know that most of them will never match the regex. The `makeClickableLinks`-regex should be the fastest of all (maybe a very simple regex is so fast, that `strpos()` is never needed?!) so the answer is that I will use `strpos()` in all cases. Do you think a test with `preg_replace_callback()` and the [best regex trick](http://www.rexegg.com/regex-best-trick.html) is worthwhile? – mgutt Apr 19 '15 at 13:02
  • Hmm... interesting link. I'm not sure how to apply that 'trick' since you want all http:// occurrences to be matched and none excluded. But, I ran my test again with the callback version, and the strpos version is about 2.2 times faster. I don't have a lot of time to explore this more, but here is a paste you can experiment with: http://pastebin.com/veJT85WD. I hope this answer gets reopened. You just need 3 more votes for that to happen. Fingers crossed! – Drakes Apr 19 '15 at 13:30
  • Its 10x slower to rely on the callback trick. Here it is: `function makeClickableLinks3($s) { return preg_replace_callback('@^(?:(?!http).)*$|(?:(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-]*(\?\S+)?[^\.\s])?)?))@', function ($m) { return !$m[1] ? $m[0] : '' . $m[2] . ''; }, $s); }`. By that I'm sure to use `strpos()` ;) – mgutt Apr 19 '15 at 19:49
  • Hmm.. that's surprising indeed. Too bad we can't try PCRE's JIT compiler with PHP. I think I'll run a benchmark in a different language just to see what happens. – Lucas Trzesniewski Apr 21 '15 at 08:48
-1

Yes, using a simple search like strpos() is much faster than compiling and executing a regex, on top of the memory copying which must occur for the replace itself. If you are doing hundreds or thousands then there's no point, but if you are doing millions (especially if only 10% of them contain http) then it's going to become worthwhile to do a simple search first.

Ultimately the only way to be 100% sure is by benchmarking it, but I would be fairly certain you are going to get some improvement using strpos() first.

Peter Bowers
  • 3,063
  • 1
  • 10
  • 18
  • 1
    *I would be fairly certain* This doesn't sounds very convincing, this sounds like you're not sure what you're talking about here! If you would be, then you can proof this! Where is the *proof* of your answer? – Rizier123 Apr 19 '15 at 07:56
  • My thought is if someone asks an optimization question, it is their job to do the benchmarking. Do you really think all answers to optimization qs should be backed up with proof? I think that is an unrealistic expectation... – Peter Bowers Apr 19 '15 at 08:22
  • 1
    1. It's not all about "benchmarking" 2. *it is their job to do the benchmarking* For what do they ask then? For an assumption?! 3. *Do you really think all answers to optimization qs should be backed up with proof?* YES! Every answer if possible should have a proof, because then readers can easier understand it and see it directly how it works! – Rizier123 Apr 19 '15 at 08:25
  • Could you give me some examples of optimization answers which include "proof' (a strong word) without benchmarking? Optimization is so data dependent I'm just not seeing it... – Peter Bowers Apr 19 '15 at 09:00
  • Exactly benchmarking is depending on how and with which data you use it. So just an example with an explanation what the differences are and the benchmarks (links): http://stackoverflow.com/a/3570604/3933332 and another one: http://stackoverflow.com/a/186386/3933332 – Rizier123 Apr 19 '15 at 09:04
  • 1
    FYI, PCRE will start by searching whether `https` is in the string, and I think the pattern doesn't even need to be [studied](http://stackoverflow.com/questions/28589611/pcre-php-concrete-example-of-the-usage-and-utility-of-the-s-extra-analysis-of) for this to happen because it starts with fixed text. That's why I *think* using `strpos` is redundant and will slow down the processing. But only a *benchmark* can tell that for sure (after all I don't know what PHP does on it own), so you shouldn't assert things like that in your answer unless you check them first. – Lucas Trzesniewski Apr 19 '15 at 10:02
  • Hi guys, I did a simple benchmark and posted the results and the code if you care to take a look. Surprisingly, it turns out @PeterBowers is correct. – Drakes Apr 19 '15 at 11:04