There are simple way to get a character from multibyte string in PHP?

Question

This is my problem: My language (Portuguese) uses ISO-8859-1 char encoding! When I want access a character from a string like 'coração' (heart) I use:

mb_internal_encoding('ISO-8859-1');
$str = "coração";

$len = mb_strlen($str,'UTF-8');

for($i=0;$i<$len;++$i)
    echo mb_substr($str, $i, 1, 'UTF-8')."<br/>";

This produces:

c
o
r
a
ç
ã
o

This works fine... But my issue is if the use of mb_substr function is not fast as simple string normal access! But I want a simple way to do this.... like in normal string character access: echo $str[$pos].... It is possible?

score 4 · Answer 1 · edited May 23 '17 at 12:16

4

mb_substr function is not fast as [...] like in normal string character access: echo $str[$pos].... It is possible?

No.

The multibyte functions have to check every character to determine how many bytes (1 to 4 in UTF-8) it occupies. There you immediately have the reason why character indexing ($a[n]) won't work: you don't know what byte(s) you need to get the n th character before you've read all characters before that one.

To speed things up a bit, you can look at the answers here: How to iterate UTF-8 string in PHP?

However, since you use ISO 8859-1 or Latin-1, you don't have to use the mb_ functions at all, since in that encoding all characters are encoded in one byte.

edited May 23 '17 at 12:16

Community

1
1

answered May 02 '12 at 11:24

CodeCaster

147,647
23
218
272

Upvoted. Essentially the two answers on the link provided, [this one](http://stackoverflow.com/a/14366023/793036) and [my answer](http://stackoverflow.com/a/17156392/793036) if you have mbstring.func_overload set to 7, are what you want. They basically do the indexing if it's available and use the slow mb_substr only if necessary. In OP's example, it will only require mb_substr once. – Andrew Jun 17 '13 at 20:49
Thank you for the usefull links and explainations provided. 5 years later, still the most relevant answer. – Valdrinium Jun 29 '17 at 16:46

score 1 · Answer 2 · answered May 02 '12 at 11:34

1

Try:

preg_match_all( "/./u", $str, $ar_chars );
print_r( $ar_chars );

answered May 02 '12 at 11:34

tty01

11
1

score 0 · Answer 3 · answered Apr 28 '12 at 05:10

0

... Sort of. If you use a fixed-width encoding (ISO 8859-*, UCS-2, or UTF-32, or UTF-16 within the BMP) then you can use a fixed multiplier for character accesses. You will still need to make multiple accesses for the multiple-byte encodings though.

answered Apr 28 '12 at 05:10

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Well... But my question is about a efficient way to do these accesses. I tested a normal string concatenation loop ($new_str .= $old_str[2] - just for testing...) and using mb_substr ($new_str .= mb_substr($old_str, 2, 1, 'UTF-8') and I got this (with loop 50 000 iterations): 0.016 s to normal access against 4.9802091121674 s to mb_substr function! It's a big performance trouble! – Lucas Batistussi Apr 28 '12 at 05:20
With a fixed-width encoding you can use a fixed multiplier. – Ignacio Vazquez-Abrams Apr 28 '12 at 05:21
How I could do this then? Show me an example! – Lucas Batistussi Apr 28 '12 at 05:22
`substr($ucs2string, $pos * 2, 2)` – Ignacio Vazquez-Abrams Apr 28 '12 at 05:23
Well... but and about performance issue (like i showed in the test I realized [above])? – Lucas Batistussi Apr 28 '12 at 05:26

There are simple way to get a character from multibyte string in PHP?

3 Answers3