mb_strpos vs strpos, what's the difference?

Question

Yes: I know. We should use mb_* function when we're working with multibyte char. But when we're using strpos? Let's take a look this code (saved in utf-8)

var_dump(strpos("My symbol utf-8 is the €.", "\xE2\x82\xAC")); // int(23)

There is a difference of using mb_strpos? Does't makes this work the same jobs? After all, does't strpos seek a string (multiple byte)? Is there a reason to use instead strpos?

This might help :: http://stackoverflow.com/questions/5712226/when-should-i-use-mb-strpos-over-strpos — Sudhir Bastakoti, Dec 17 '12 at 11:41

Esailija · Accepted Answer · 2012-12-17T12:09:54.927

For UTF-8, matching the byte sequence is exactly the same as matching character sequence.

So they both will find the needle at exactly the same point, but mb_strpos counts full UTF-8 byte sequencees before the needle, where as strpos calculates any bytes. So if your string had another multi-byte UTF-8 sequence, the results would be different:

strpos("My symbolö utf-8 is the €.", "€") !== mb_strpos("My symbolö utf-8 is the €.", "€", 0, "UTF-8")

But:

strpos("My symbol utf-8 is the €.", "€") === mb_strpos("My symbol utf-8 is the €.", "€", 0, "UTF-8")

score 8 · Answer 2 · answered Dec 17 '12 at 11:40

8

Depending on the character set in use and the string being searched for, this may or may not make a difference.

strpos() looks for the byte sequence that is passed as the needle.

mb_strpos() does the same thing but it also respects character boundaries.

So strpos() will match if the byte sequence occurs anywhere in the string. mb_strpos() will only match if the byte sequence also represents a valid set of complete characters.

answered Dec 17 '12 at 11:40

DaveRandom

87,921
11
154
174

That's not true, `mb_strpos()` will also find invalid set of characters. The only difference is return values and arguments (characters instead of bytes), but invalid characters are handled just the same. – Danon Sep 20 '21 at 12:18

Jsowa · Answer 3 · 2020-10-02T17:40:26.283

I don't the find above example fully transparent and some users may be confused.

mb_string() should be used for multi-byte encoding, and what is multi-byte encoding you have explained in other questions, e.g. here.

Recently we use mostly UTF encodings as UTF-8 in this example (also UTF-16) which is multi-byte character set, however usually we use only ASCII character sets (e.g. for English) and the result of strpos and mb_strpos is identical for them.

The difference is visible when we use multi-byte characters, i.e. Chinese characters.

echo mb_internal_encoding(); //UTF-8

echo strpos('我在买绿茶', '在'); //3

echo mb_strpos('我在买绿茶', '在'); //1

So obviously it's applicable to Chinese characters, but also to emoji which some are not aware of.

To give wider view how it works I show length of the following string with strlen() and mb_strlen() functions.

echo strlen('我在买绿茶'); //15

echo mb_strlen('我在买绿茶'); //5

mb_strpos vs strpos, what's the difference?

3 Answers3

Linked

Related