22

This code

print mb_substr('éxxx', 0, 1);

prints an empty space :(

It is supposed to print the first character, é. This seems to work however:

print mb_substr('éxxx', 0, 2);

But it's not right, because (0, 2) means 2 characters...

Alex
  • 66,732
  • 177
  • 439
  • 641

2 Answers2

49

Try passing the encoding parameter to mb_substr, as such:

print mb_substr('éxxx', 0, 1, 'utf-8');

The encoding is never detected automatically.

povilasp
  • 2,386
  • 1
  • 22
  • 36
  • 4
    The encoding is *never* detected automatically, it just always *defaults* to something. – deceze Dec 19 '12 at 13:20
  • Could it be a better idea if you use [`mb_detect_encoding`](http://php.net/manual/en/function.mb-detect-encoding.php) to *actually* try to detect the encoding? – Alvin Wong Dec 19 '12 at 13:20
  • 4
    @AlvinWong No. *Know* what encoding you're working with, there's no other way. – deceze Dec 19 '12 at 13:21
  • @Alvin Wong, that would be more correct, yes, but I could also say that using anything but utf-8 can be considered adventurous and marginal :) – povilasp Dec 19 '12 at 13:21
  • @deceze, wasn't sure, but thanks for the clarification, I updated the answer. – povilasp Dec 19 '12 at 13:21
  • tx that works. Can mb_substr work like `substr($string, 1)` without giving it the mb_strlen() argument ? – Alex Dec 19 '12 at 13:24
  • @Alex, that I think is another question, but my guess would be that yes - because the parameter is optional as it is in substr. – povilasp Dec 19 '12 at 13:27
  • yes, but that UTF-8 thing has to go after that argument. Anyway nvm, I`ll just use mb_strlen .. – Alex Dec 19 '12 at 13:28
  • 3
    OK, then how about [`mb_internal_encoding`](http://hk1.php.net/manual/en/function.mb-internal-encoding.php) instead of passing `"utf-8"` to all `mb_*` functions? Just like Álvaro G. Vicario has pointed out – Alvin Wong Dec 19 '12 at 13:31
  • @AlvinWong is right, it's better to look to mb_internal_encoding if this is not only function usage and you are planning to use a lot of mb_* functions through out your code. – povilasp Dec 19 '12 at 13:38
13

In practice I've found that, in some systems, multi-byte functions default to ISO-8859-1 for internal encoding. That effectively ruins their ability to handle multi-byte text.

Setting a good default will probably fix this and some other issues:

mb_internal_encoding('UTF-8');
Álvaro González
  • 142,137
  • 41
  • 261
  • 360