4

I'm having problem with spliting word with utf-8 polish chars.

I've been checking php docs for str_split, but there's no parameter to set charset.

I've word: "mała" And i have to split it by letters to wrap each single letter with span and return html string in result.

Result of str_split('mała'):

array:5 [
  0 => "m"
  1 => "a"
  2 => b"Å"
  3 => b"‚"
  4 => "a"
]

json_last_error_message() returns "Malformed UTF-8 characters, possibly incorrectly encoded" error, so as i thought it's problem related to polish letters, but i can't find a way to set str_split charset.

Here's prepared array to be JSON encoded:

array:2 [
  "pieces" => array:6 [
    0 => "<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">m</span><span class="dropable">a</span>"
    1 => "<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">s</span><span class="dropable">a</span>"
    2 => "<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">p</span><span class="dropable">a</span>"
    3 => b"<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">Å</span><span class="dropable">‚</span><span class="dropable">a</span>"
    4 => "<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">c</span><span class="dropable">a</span>"
    5 => "<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">t</span><span class="dropable">a</span>"
  ]
  "engine" => "Wstepne"
]

Index number 3 contains weird "b" before string and these malformed characters.

Code to generate these strings is:

$htmlString = '';
foreach(str_split($piece) as $key => $letter){ 
    $htmlString .= '<span class="dropable">'.$letter.'</span>';
}
return $htmlString;

Tried to use utf8_encode on $letter, it fixed problem with b in front of string, but still it creates 2 spans:

3 => "<span class="dropable">m</span><span class="dropable">a</span><span class="dropable">Å</span><span class="dropable">‚</span><span class="dropable">a</span>"

Any more ideas?

Thanks for help

1 Answers1

13

str_split works on byte level and not on character level (despite its name). So in fact you're splitting mała along its bytes and not along its characters. That's why you're getting an array of five items instead of four. Index 2 and 3 together form the UTF-8 encoding of ł.

You need to use either the mbstring or the iconv extension to split your string manually.

$str = 'mała';
$len = mb_strlen($str, 'UTF-8');
$result = [];
for ($i = 0; $i < $len; $i++) {
    $result[] = mb_substr($str, $i, 1, 'UTF-8');
}
var_dump($result);
Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
  • I woudln't say "byte level", more like the assumption that 1 char = 1 byte, because PHP's core string fucntions aren't unicode-aware. – Marc B Jun 30 '16 at 15:17
  • That's true - if you know that the terms *char* and *byte* are equivalent and that unicode characters are something different. Otherwise it might be confusing. That's why I chose *byte*. – Stefan Gehrig Jun 30 '16 at 15:19
  • You could also use the `grapheme_*` functions from the intl-extension that will then also behave correctly on combined Unicode-characters… ;) – heiglandreas Jun 21 '17 at 07:31
  • I used this to count the unique letters of various words. `$letters = []; foreach ($words as $word) { $len = mb_strlen($word, 'UTF-8'); for ($i = 0; $i < $len; $i++) { $letters[] = mb_substr($word, $i, 1, 'UTF-8'); } } $letters = array_unique($letters); asort($letters); $letters_string = implode(' ', $letters); echo $letters_string;` – Avatar Feb 18 '23 at 07:20