Why string with encoding 'UTF-8' have broken symbols when I looping the string

Question

I have string with umlauts in utf-8 and it displayed ok:

var_dump($content);

It return me "höst lanseras". But when I try this:

for ($i = 0; $i < strlen($content) - 1; $i++) {
    var_dump($content[$i]);
}

I have this:

string(1) "h"
string(1) "o"
string(1) "�"
string(1) "�"
string(1) "s"
string(1) "t"
string(1) " "
string(1) "l"
string(1) "a"
string(1) "n"
string(1) "s"
string(1) "e"
string(1) "r"
string(1) "a"
string(1) "s"

How to get umlaut symbol as element of array?

UTF-8 is using more bytes for a string. Test `mb_strlen($content)==strlen($content)`. — ob_start, Dec 22 '15 at 22:49
PHP treats strings as list of bytes, not characters. See [What every programmer absolutely, positively needs to know about encodings and character sets to work with text](http://kunststube.net/encoding/) — mario, Dec 22 '15 at 22:49
Notice 13 chars in your `var_dump($content);` output and 15 chars in your looped `var_dump($content[$i]);` output — RiggsFolly, Dec 22 '15 at 22:53

score 1 · Accepted Answer · answered Dec 23 '15 at 11:33

Within UTF-8, "ö" is encoded using more than one byte.
PHP strings are dumb byte arrays; PHP is not aware of "characters" or such at all.
Accessing string offsets using $str[x] accesses one specific byte; strlen reports the length in bytes, not "characters".

Put all this together and the result is that you're accessing individual bytes rather than characters, and in the case of "ö" that results in outputting half of a character/nonsensical bytes.

Use the mb_ functions to iterate and access strings properly by character, not by byte count: mb_strlen, mb_substr.

score 0 · Answer 2 · answered Dec 23 '15 at 11:32

0

strlen() is single-byte:

strlen() returns the number of bytes rather than the number of characters in a string.

UTF-8 is not so you need to use the multi-byte alternative: mb_strlen()

Same rule applies to any almost all string manipulation.

answered Dec 23 '15 at 11:32

Álvaro González

142,137
41
261
360

Why string with encoding 'UTF-8' have broken symbols when I looping the string

2 Answers2