20

I had criticized an answer that suggested preg_match over === when finding substring offsets in order to avoid type mismatch.

However, later on the answer's author has discovered that preg_match is actually significantly faster than multi-byte operating mb_strpos. Normal strpos is faster than both functions but of course, cannot deal with multibyte strings.

I understand that mb_strpos needs to do something more than strpos. However, if regex can do it almost as fast as strpos, what is it that mb_strpos does that takes so much time?

I have strong suspicion that it's an optimization error. Could, for example, PHP extensions be slower than its native functions?

mb_strpos($str, "颜色", 0 ,"GBK"): 15.988190889 (89%)
preg_match("/颜色/", $str): 1.022506952 (6%)
strpos($str, "dh"): 0.934401989 (5%)

Functions were run 106 times. The absolute time(s) accounts for the sum of time of 106 runs of a function, rather than average for one.

The test string is $str = "代码dhgd颜色代码";. The test can be seen here (scroll down to skip the testing class).

Note: According to one of the commentators (and common sense), preg_match also does not use multi-byte when comparing, being subject to same risk of errors as strpos.

Community
  • 1
  • 1
Tomáš Zato
  • 50,171
  • 52
  • 268
  • 778
  • 3
    Your regex does not use the Unicode `/u` flag, so likely just does a binary comparison. – mario Jun 21 '14 at 18:24
  • I don't quite get this. If dumb `preg_match` (without the `u` modifier) works then plain old `strpos` **must** also work (and obviously will be faster). Please clarify. – Jon Jun 21 '14 at 18:24
  • @Jon They may use different search algorithms. I think `strpos` uses the ‘dumbest’ while PCRE may use a smarter one. – Gumbo Jun 21 '14 at 18:28
  • 5
    @Jon - strops() ___can___ find multibyte sequences, but it may also return false positives for multibyte sequences because it doesn't recognise character boundaries for multibyte characters, it only works on byte sequences – Mark Baker Jun 21 '14 at 18:29
  • Note that performance of preg_match will depend on the underlying version of pcre – Mark Baker Jun 21 '14 at 18:30
  • @MarkBaker: I 'm aware of that, but the OP evidently is not. And as you undoubtedly know, in the most common case where we 're talking UTF-8 the possibility for false positives does not exist. – Jon Jun 21 '14 at 18:30
  • @TomášZato , can you repeat the test and report times using a much longer haystack with the same needle in the middle? That way it would be more apparent whether the problem is mostly in the setup code (I'm seeing a *lot* of initialization in "ext/mbstring/libmbfl/mbfl/mbfilter.c" that *looks* expensive, I'll valgrind it later to see whether it really is - or if it's the search loop. – LSerni Jun 21 '14 at 18:31
  • @TomášZato: `preg_match` has *exactly* the same pitfalls as `strpos`. It makes absolutely no sense to consider `preg_match` a solution to the problem and bar `strpos` from also being one. Also, look at this: http://ideone.com/ULItqd – Jon Jun 21 '14 at 18:32
  • Check the tests of all three methods, `strpos`, `mb_strpos` and `preg_match` in this code. `strpos`and `preg_match` are fairly equal, but still doesn’t explain why `mb_strpos` is needed at all then? http://sandbox.onlinephpfunctions.com/code/f2394f2a04b99919895c90d85d6a758e1dc5110c – Giacomo1968 Jun 21 '14 at 18:33
  • Also, simply using strpos() to test if a particular sequence exists in a multibyte string is one thing, but using it to get the _character_ offset is another matter entirely. strops() will return the byte offset, mb_strops() will return the character offset – Mark Baker Jun 21 '14 at 18:35
  • 2
    One thing that `mb_strpos` does that you cannot get elsewhere is tell you the *character* offset of the substring regardless of what the input encoding is. – Jon Jun 21 '14 at 18:35
  • @MarkBaker: Sure (we were typing concurrently it seems). But `preg_match` is considered a solution here, and it doesn't tell you the character offset either. – Jon Jun 21 '14 at 18:36
  • @Jon Using the PREG_OFFSET_CAPTURE flag would give you the offset, but I'm not sure how much of an overhead that would add – Mark Baker Jun 21 '14 at 18:37
  • @TomášZato It's also worth noting that multiple runs with PCRE will give you a result that is not entirely representative of the reality. PHP caches compiled regexes, meaning that on each iteration after the first there is substantially less work for PCRE to do. `mb_strpos()` on the other hand always has a lot more work to do because it always pays attention to character sets (meaning it has to find the correct tables etc) and it actually checks character boundaries. PCRE, without unicode input and `/u`, is just doing byte comparisons, optimised to essentially the same logic as `strpos()`. – DaveRandom Jun 21 '14 at 18:54
  • @DaveRandom So to sum up: Because `mb_` functions need to actually parse the final multibyte string info—including actual character position & not just raw Unicode data—that adds additional overhead to the process. So for a basic/simple string comparison like with `strpos` or `preg_match` (without the `u` modifier) it’s just checking raw Unicode so that accounts for the speed difference. – Giacomo1968 Jun 21 '14 at 19:10
  • @JakeGould well it's not even checking unicode. If the haystack is utf8 and the needle is utf16, it won't find it, because it's just doing a byte-by-byte comparison. It's also not the case the PCRE is "simple", but an expr that is just `/literal/` with no complicated regexp things (repetition, char classes, etc etc) it will be optimised to be essentially the same as calling `strpos()` with a little bit of overhead (actually, the regex compile step is pretty expensive, but since it's cached that's not really relevant in a 10^6 loop...) – DaveRandom Jun 21 '14 at 19:19

4 Answers4

22

To understand why the functions have a different runtime you need to understand what they actually do. Because summing them up as ‘they search for needle in haystack’ isn’t enough.

strpos

If you look at the implementation of strpos, it uses zend_memstr internally, which implements a pretty naive algorithm for searching for needle in haystack: Basically, it uses memchr to find the first byte of needle in haystack and then uses memcmp to check whether the whole needle begins at that position. If not, it repeats the search for the first byte of needle from the position of the previous match of the first byte.

Knowing this, we can say that strpos does only search for a byte sequence in a byte sequence using a naive search algorithm.

mb_strpos

This function is the multi-byte counterpart to strpos. This makes searching a little more complex as you can’t just look at the bytes without knowing to which character they belong to.

mb_strpos uses mbfl_strpos, which does a lot more in comparison to the simple algorithm of zend_memstr, it’s like 200 lines of complex code (mbfl_strpos) compared to 30 lines of slick code (zend_memstr).

We can skip the part where both needle and haystack are converted to UTF-8 if necessary, and come to the major chunk of code.

First we have two setup loops and then there is the loop that proceeds the pointer according to the given offset where you can see that they aware of actual characters and how they skip whole encoded UTF-8 characters: since UTF-8 is a variable-width character encoding where the first byte of each encoded character denotes the whole length of the encoded character. This information is stored in the u8_tbl array.

Finally, the loop where the actual search happens. And here we have something interesting, because the test for needle at a certain position in haystack is tried in reverse. And if one byte did not match, the jump table jtbl is used to find the next possible position for needle in haystack. This is actually an implementation of the Boyer–Moore string search algorithm.

So now we know that mb_strpos

  • converts the strings to UTF-8, if necessary
  • is aware of actual characters
  • uses the Boyer–Moore search algorithm

preg_match

As for preg_match, it uses the PCRE library. Its standard matching algorithm uses a nondeterministic finite automaton (NFA) to find a match conducting a depth-first search of the pattern tree. This is basically a naive search approach.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • Very nice write-up. One thing I wondered about (and didn't put in my simple answer) is if mb does also unicode normalization when doing an unicode search. But I imagined that would probably too much for that lib so I skipped it. - regarding the search algorithm, does it mean that whith a much larger haystack, the relative times would shift, towards mb_strpos, right? – hakre Jun 21 '14 at 22:35
  • @hakre: I seem to remember that the BM algorithm works better with longer needles, made up of relatively few characters in respect to the haystack. Even better if the haystack contains a lot of characters that don't occur in the needle. Then BM can run onwards in leaps and bounds, way faster than naive byte compare. – LSerni Jun 22 '14 at 12:57
  • Terrific explanation... I'm glad I'm not the one having to program these functions! :) +1 – zx81 Jun 26 '14 at 04:33
12

I am leaving out preg_match to make the analysis more punctuated.

Taken your observation that mb_strpos is relatively slower compared to strpos, it leads you to the assumption that — because of the consumed time — mb_strpos does more than strpos.

I think this observation is correct.

You then asked what is that "more" that is causing the time difference.

I try to give a simple answer: That "more" is because strpos operates on binary strings (one character = 8 bit = 1 octet = 1 byte). mb_strpos operates on encoded character sequences (as nearly all of the mb_* functions do) which can be X bits, perhaps even in variable length per each character.

As this is always about a specific character encoding, both the haystack as well as the needle string (probably) need to be first validated for that encoding, and then the whole operation to find the string position needs to be done in that specific character encoding.

That is translation work and — depending on encoding — also requires a specific search algorithm.

Next to that the mb extension also needs to have some structures in memory to organize the different character encodings, be it translation tables and/or specific algorithms. See the extra parameter you inject — the name of the encoding for example.

That is by far more work than just doing simple byte-by-byte comparisons.

For example the GBK character encoding is pretty interesting when you need to encode or decode a certain character. The mb string function in this case needs to take all these specifics into account to find out if and at which position the character is. As PHP only has binary strings in the userland from which you would call that function, the whole work needs to be done on each single function call.

To illustrate this even more, if you look through the list of supported encodings (mb_list_encodings), you can also find some like BASE64, UUENCODE, HTML-ENTITIES and Quoted-Printable. As you might imagine, all these are handled differently.

For example a single numeric HTML entity can be up to 1024 bytes large, if not even larger. An extreme example I know and love is this one. However, for that encoding, it has to be handled by the mb_strpos algorithm.

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
hakre
  • 193,403
  • 52
  • 435
  • 836
  • 1
    OK, it *does more*, we knew that. But why is PCRE apparently so much faster for the apparently same task? – deceze Jun 21 '14 at 19:42
  • 1
    @deceze: PCRE in PHP knows only two encodings: the one of strpos (where it is slower) and UTF-8. For UTF-8 a lot of optimized code exists. That is far beyond what the mb extension offers as encodings, the linked manual page counts 60 different. – hakre Jun 21 '14 at 19:45
  • 1
    @deceze: Also both are not doing the same even when they both use UTF-8 encoding: mb_strpos returns the numeric character position, preg_match returns the byte-offset (if you enable offset returning). That also shows that internally mb_strpos has to do things differently, e.g. managing character positions next to byte offsets. I don't find it that hard to imagine that these things might take some times - and also the quality of the library might not be as precise as pcre which is pretty widely used. – hakre Jun 21 '14 at 19:57
  • @AmalMurali: Thanks for the touch-up and comment :) – hakre Jun 21 '14 at 22:30
8

Reason of slowness

Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos (or, I guess, most of the mb_* family) gets called:

Unless haystack is in the default format, encode it to the default format:

if (haystack->no_encoding != mbfl_no_encoding_utf8) {
        mbfl_string_init(&_haystack_u8);
        haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);
        if (haystack_u8 == NULL) {
                result = -4;
                goto out;
        }
} else {
        haystack_u8 = haystack;
}

Unless needle is in the default format, encode it to the default format:

if (needle->no_encoding != mbfl_no_encoding_utf8) {
        mbfl_string_init(&_needle_u8);
        needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);
        if (needle_u8 == NULL) {
                result = -4;
                goto out;
        }
} else {
        needle_u8 = needle;
}

According to a quick check with valgrind, the encoding conversion accounts for a huge part of mb_strpos's runtime, about 84% of the total, or five-sixths:

218,552,085  ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]
183,812,085  ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]

which appears to be consistent with the OP's timings of mb_strpos versus strpos.

Encoding not considered, mb_strpos'ing a string is exactly the same of strpos'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.

Accelerating mb_strpos...

So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl* do conversion and compare, which is mbfl_no_encoding_utf8 (UTF-8):

  • Keep your data in UTF-8.
  • Convert user input to UTF-8 as soon as practical.
  • Convert, if necessary, back to client encoding if needed.

Then your pseudo-code:

$haystack = "...";
$needle   = "...";

$res = mb_strpos($haystack, $needle, 0, $Encoding);

becomes:

$haystack = "...";
$needle   = "...";

mb_internal_encoding('UTF-8') or die("Cannot set encoding");
$haystack   = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);
$needle     = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);

$res = mb_strpos($haystack, $needle, 0);

...when it's worth it

Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_* function.

Community
  • 1
  • 1
LSerni
  • 55,617
  • 10
  • 65
  • 107
-1

The problems with mb_ performance may be caused by a messed php-mbstring package installation (on a linux). Installing it explicitly for the exact version of php installation helped me.

sudo apt-get install php7.1-mbstring

...

Before: Time: 16.17 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
After:  Time:  1.81 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
Klesun
  • 12,280
  • 5
  • 59
  • 52