1

I have string with umlauts in utf-8 and it displayed ok:

var_dump($content);

It return me "höst lanseras". But when I try this:

for ($i = 0; $i < strlen($content) - 1; $i++) {
    var_dump($content[$i]);
}

I have this:

string(1) "h"
string(1) "o"
string(1) "�"
string(1) "�"
string(1) "s"
string(1) "t"
string(1) " "
string(1) "l"
string(1) "a"
string(1) "n"
string(1) "s"
string(1) "e"
string(1) "r"
string(1) "a"
string(1) "s"

How to get umlaut symbol as element of array?

Kleyton
  • 65
  • 5
  • 1
    UTF-8 is using more bytes for a string. Test `mb_strlen($content)==strlen($content)`. – ob_start Dec 22 '15 at 22:49
  • 2
    PHP treats strings as list of bytes, not characters. See [What every programmer absolutely, positively needs to know about encodings and character sets to work with text](http://kunststube.net/encoding/) – mario Dec 22 '15 at 22:49
  • Notice 13 chars in your `var_dump($content);` output and 15 chars in your looped `var_dump($content[$i]);` output – RiggsFolly Dec 22 '15 at 22:53

2 Answers2

1

Within UTF-8, "ö" is encoded using more than one byte.
PHP strings are dumb byte arrays; PHP is not aware of "characters" or such at all.
Accessing string offsets using $str[x] accesses one specific byte; strlen reports the length in bytes, not "characters".

Put all this together and the result is that you're accessing individual bytes rather than characters, and in the case of "ö" that results in outputting half of a character/nonsensical bytes.

Use the mb_ functions to iterate and access strings properly by character, not by byte count: mb_strlen, mb_substr.

deceze
  • 510,633
  • 85
  • 743
  • 889
0

strlen() is single-byte:

strlen() returns the number of bytes rather than the number of characters in a string.

UTF-8 is not so you need to use the multi-byte alternative: mb_strlen()

Same rule applies to any almost all string manipulation.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360