Wrong output when using array indexing on UTF-8 string

Question

I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:

$string = "üÜöÖäÄ";
echo $string[0];

I am expecting to see ü, but I get � -- why?

Jon · Answer 1 · 2011-06-11T19:35:54.747

Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.

What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).

Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.

mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.

You can see that $string[0] gives you back just one byte with:

$string = "üÜöÖäÄ";
echo strlen($string[0]);

While mb_substr gives you back two bytes:

$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));

And these two bytes are in fact just one character (you need to use mb_strlen for this):

$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');

Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:

$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));

You can see most of the above in action.

I recommend using http://www.php.net/manual/en/function.mb-internal-encoding.php so you wont need to specify 'utf-8' in every `mb_` function. — Marwelln, Jun 11 '11 at 18:41
If you need to iterate UTF8 encoded string, have a look also here: http://stackoverflow.com/questions/3666306/how-to-iterate-utf-8-string-in-php — Stano, Jul 11 '13 at 11:54
This is something I've never read in any UTF-8 migration guide. Is it because nobody (except us… well, not me!) uses array indexing on (single-byte) strings? It is a pretty big problem for me, and I don't understand why all migration guides choose to ignore it, just to make it look like the migration is going to be painless. — sylbru, Jan 11 '18 at 22:14

Wrong output when using array indexing on UTF-8 string

1 Answers1

Linked

Related