strlen() and UTF-8 encoding

Question

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?

I'm only interested to know about strlen(), not other functions

This is the string:

$1ï¿½2

I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.

I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.

PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

UTF-8 characters are multibyte characters, and count as as-many-characters-as-they-are-long-in-bytes when using `strlen`. Use http://php.net/manual/en/function.mb-strlen.php for expected results. — Rem.co, Jun 14 '12 at 13:27
@RemcoOverdijk utf-8 encoded characters can be 1-6 bytes long. — Esailija, Jun 14 '12 at 13:28
@Esailija And right you are! I was too hasty, sorry. --correcting-- — Rem.co, Jun 14 '12 at 13:29
my question is only about strlen(). If I put this string into strlen() my answer is 6. When I run iconv_get_encoding() I get "UTF-8" — Jon Lyles, Jun 14 '12 at 13:40
@Esailija Not true, UTF-8 character (encoded code point) can be at most 4 bytes long. — Pavel Radzivilovsky, Jun 15 '12 at 14:52

Anton · Answer 1 · 2012-06-14T14:26:27.980

22

how about using mb_strlen() ?

http://lt.php.net/manual/en/function.mb-strlen.php

But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.

edited Jun 14 '12 at 14:26

answered Jun 14 '12 at 13:27

Anton

1,029
7
19

1

yes I saw mb_strlen() in other answers, but I'm specifically looking at strlen() – Jon Lyles Jun 14 '12 at 13:38
fixed my answer to answer your comment question. – Anton Jun 14 '12 at 14:20
ew, I wasn't aware of `mbstrung.func_overload` - enabling that would break a bunch of my code as I always assume strlen is the length in bytes. – thomasrutter Oct 19 '18 at 00:34

bames53 · Accepted Answer · 2014-09-23T18:55:24.697

The string you posted is six character long: $1ï¿½2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "ï¿½".

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.

It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1ï¿½2).

score 10 · Answer 3 · answered Jun 14 '12 at 13:28

10

need to use Multibyte String Function mb_strlen() like:

mb_strlen($string, 'UTF-8');

answered Jun 14 '12 at 13:28

Haim Evgi

123,187
45
217
223

score 5 · Answer 4 · answered Jun 14 '12 at 14:13

It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.

The sequence ï¿½ is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:

The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)

The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.

Then some third program comes that reads the file in latin1, and shows $1ï¿½2.

score 2 · Answer 5 · answered Jun 14 '12 at 14:07

No.

I'll use a proof by contradiction.

strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.

UTF8 encoding needs at least 1 byte per character.

We have established that:

there are 4 bytes
a character is represented by no less than 1 byte

...yet, we have 6 characters....which is a contradiction. So, no.

However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.

score 1 · Answer 6 · answered Jun 14 '12 at 13:27

1

Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).

Try mb_strlen() instead.

answered Jun 14 '12 at 13:27

Madara's Ghost

172,118
50
264
308

fun-fact: in theory, utf-8 can use up to 8 bytes per character, although this lenth isn't used till now - the maximum used length are a bunch of four-byte characters (like the Clef-sign and some Chinese characters, for example). – oezi Jun 14 '12 at 13:33
what about strlen(), is it possible for the answer to be less than 6? – Jon Lyles Jun 14 '12 at 13:42
@JonLyles: `strlen()` counts the bytes in the string. If the string has 6 bytes, it'll result in 6. – Madara's Ghost Jun 14 '12 at 13:43

strlen() and UTF-8 encoding

6 Answers6

Linked

Related