Why use multibyte string functions in PHP?

Question

At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?

Not trying to be catty here, but do you understand the material in this article? http://www.joelonsoftware.com/articles/Unicode.html — wberry, Jul 17 '11 at 06:28

score 17 · Accepted Answer · answered Jul 17 '11 at 06:32

17

All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.

From the Multibyte String Introduction:

When you manipulate (trim, split, splice, etc.) strings encoded in a multibyte encoding, you need to use special functions since two or more consecutive bytes may represent a single character in such encoding schemes. Otherwise, if you apply a non-multibyte-aware string function to the string, it probably fails to detect the beginning or ending of the multibyte character and ends up with a corrupted garbage string that most likely loses its original meaning.

answered Jul 17 '11 at 06:32

Francois Deschenes

24,816
4
64
61

Hah, that was my fault, sorry. I know, what unicode is and I know, that some chars use two or more bytes for representation (like german umlauts), but I never thought, that basic functions in PHP couldn't handle it that way. For my understanding it should be that `strlen` give me the _string_ length (like `mb_strlen`). When I want to know the real size of the string this should be possible with something like `sizeof`. But not all PHP functions have an equivalent in mbstring functions (like `str_replace`)? :( – rabudde Jul 17 '11 at 07:13
2

@rabudde - You can get a list of function that have multibyte equivalents [here](http://www.php.net/manual/en/mbstring.overload.php). As for `str_replace`, it does work with UTF-8 strings. See [this](http://stackoverflow.com/questions/2652193/can-str-replace-be-safely-used-on-a-utf-8-encoded-string-if-its-only-given-valid). – Francois Deschenes Jul 17 '11 at 07:23
1

It's a bit confusing, that `explode` and `str_replace` work with UTF-8. But thanks! – rabudde Jul 17 '11 at 07:44
2

So would you say it's a good idea to replace **all** php function in my own applications with their equivalent mbstring function? Could there be some negative side effect (regardless of performance)? – rabudde Jul 17 '11 at 09:37
1

If you don't want to replace all `str*` functions by `mb_*` ones, you can add `mbstring.func_overload = 7`inside the `[mbstring]`section of php.ini. The value 7 is actually a mask which means "all" but you can change it to override only some specific functions. – Charles-Édouard Coste Jun 14 '17 at 06:52
Are you sure about `All of the PHP string functions do not handle multibyte strings` ? I read somewhere (but can't remeber where) that PHP string function only support multi-byte char up to 3 byte, and that `mb_*` adds support for 4-byte characters. – Adam May 19 '20 at 06:04

score 8 · Answer 2 · answered Jul 17 '11 at 06:33

8

Here is my answer in plain English. A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.

Read http://www.php.net/manual/en/intro.mbstring.php

answered Jul 17 '11 at 06:33

Kumar

5,038
7
39
51

Well, "Hello World!" in Japanese/Chinese/Korean doesn't (necessarily) have more than 12 bytes, but the equivalent phrase in those languages will have more bytes than characters. – JJJ Jul 17 '11 at 07:02
That's what I meant. Never had any chance to work/use on these languages or strings. – Kumar Jul 17 '11 at 07:09

score 7 · Answer 3 · edited Apr 27 '22 at 18:43

You do not need to use UTF-8 aware code to process UTF-8. For the most part.

I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.

It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.

But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.

I'll explain why.

It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.

Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.

what about one character in two other chars? is it possible to have `\uabcd\u1234` where `\cd12` is valid too? — SOFe, Sep 25 '18 at 13:04

score 2 · Answer 4 · answered Jan 08 '17 at 00:22

PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.

So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.

Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.

However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.

So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.

If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.

The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.

So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.

Summary:

No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.

That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.

Uttam Panara · Answer 5 · 2016-07-22T19:25:53.860

multibyte => multi + byte.

1) It is use to work with string which is in other language(means not in English) format.

2) Default PHP string functions only work proper with English (or releted to it) language.

3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
           Suppose We need to apply string functions on "Hello".
           In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi ( नमस्ते), Gujarati (હેલો).
           Different language can it's own character sets

so that mbstring introduced for communicate with various languages like (chines,Japanese etc).

score 0 · Answer 6 · answered Mar 04 '19 at 04:40

Raul González is a perfect example of why:

It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.

The unit test below is an example how you can get an error like this

General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)

and how you can avoid it

public function test_substr(): void
{
    $name = 'Raul González';
    $user = factory(User::class)->create(['name' => $name]);
    try {
        $name1      = substr($name, 0, 10);
        $user->name = $name1;
        $user->save();
    } catch (Exception $ex) {

    }
    $this->assertTrue(isset($ex));

    $name2      = mb_substr($name, 0, 10);
    $user->name = $name2;
    $user->save();

    $this->assertTrue(true);
}

PHP Laravel and PhpUnit was used for illustration.

Why use multibyte string functions in PHP?

6 Answers6

Linked

Related