6

This is my problem: My language (Portuguese) uses ISO-8859-1 char encoding! When I want access a character from a string like 'coração' (heart) I use:

mb_internal_encoding('ISO-8859-1');
$str = "coração";

$len = mb_strlen($str,'UTF-8');

for($i=0;$i<$len;++$i)
    echo mb_substr($str, $i, 1, 'UTF-8')."<br/>";

This produces:

c
o
r
a
ç
ã
o

This works fine... But my issue is if the use of mb_substr function is not fast as simple string normal access! But I want a simple way to do this.... like in normal string character access: echo $str[$pos].... It is possible?

Johannes Pille
  • 4,073
  • 4
  • 26
  • 27
Lucas Batistussi
  • 2,283
  • 3
  • 27
  • 35

3 Answers3

4

mb_substr function is not fast as [...] like in normal string character access: echo $str[$pos].... It is possible?

No.

The multibyte functions have to check every character to determine how many bytes (1 to 4 in UTF-8) it occupies. There you immediately have the reason why character indexing ($a[n]) won't work: you don't know what byte(s) you need to get the n th character before you've read all characters before that one.

To speed things up a bit, you can look at the answers here: How to iterate UTF-8 string in PHP?

However, since you use ISO 8859-1 or Latin-1, you don't have to use the mb_ functions at all, since in that encoding all characters are encoded in one byte.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • Upvoted. Essentially the two answers on the link provided, [this one](http://stackoverflow.com/a/14366023/793036) and [my answer](http://stackoverflow.com/a/17156392/793036) if you have mbstring.func_overload set to 7, are what you want. They basically do the indexing if it's available and use the slow mb_substr only if necessary. In OP's example, it will only require mb_substr once. – Andrew Jun 17 '13 at 20:49
  • Thank you for the usefull links and explainations provided. 5 years later, still the most relevant answer. – Valdrinium Jun 29 '17 at 16:46
1

Try:

preg_match_all( "/./u", $str, $ar_chars );
print_r( $ar_chars ); 
tty01
  • 11
  • 1
0

... Sort of. If you use a fixed-width encoding (ISO 8859-*, UCS-2, or UTF-32, or UTF-16 within the BMP) then you can use a fixed multiplier for character accesses. You will still need to make multiple accesses for the multiple-byte encodings though.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358