Reason of slowness
Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos
(or, I guess, most of the mb_*
family) gets called:
Unless haystack is in the default format, encode it to the default format:
if (haystack->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_haystack_u8);
haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);
if (haystack_u8 == NULL) {
result = -4;
goto out;
}
} else {
haystack_u8 = haystack;
}
Unless needle is in the default format, encode it to the default format:
if (needle->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_needle_u8);
needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);
if (needle_u8 == NULL) {
result = -4;
goto out;
}
} else {
needle_u8 = needle;
}
According to a quick check with valgrind
, the encoding conversion accounts for a huge part of mb_strpos
's runtime, about 84% of the total, or five-sixths:
218,552,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]
183,812,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]
which appears to be consistent with the OP's timings of mb_strpos
versus strpos
.
Encoding not considered, mb_strpos
'ing a string is exactly the same of strpos
'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.
Accelerating mb_strpos
...
So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl*
do conversion and compare, which is mbfl_no_encoding_utf8
(UTF-8):
- Keep your data in UTF-8.
- Convert user input to UTF-8 as soon as practical.
- Convert, if necessary, back to client encoding if needed.
Then your pseudo-code:
$haystack = "...";
$needle = "...";
$res = mb_strpos($haystack, $needle, 0, $Encoding);
becomes:
$haystack = "...";
$needle = "...";
mb_internal_encoding('UTF-8') or die("Cannot set encoding");
$haystack = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);
$needle = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);
$res = mb_strpos($haystack, $needle, 0);
...when it's worth it
Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_*
function.