3

I want to get the most used word from an array. The only problem is that the Swedish characters (Å, Ä, and Ö) will only show as �.

$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';

That code will output the following:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [�] => 1
    [�] => 1
    [and] => 2
    [�] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [�] => 1
    [�] => 1
    [�] => 1
)

How can I make it to "see" the Swedish characters and other special characters?

Airikr
  • 6,258
  • 15
  • 59
  • 110
  • You shouldn't be surprised by any PHP function with a name starting with `str` not being multi-byte safe. The user comments in the manual suggest alternatives. – CBroe Sep 24 '16 at 11:51
  • @CBroe `...PHP function with a name starting with str...` where is this function? – SaidbakR Sep 24 '16 at 12:02
  • try this function `mb_str_word_count` instead `str_word_count`: http://stackoverflow.com/a/17725577/6797531 – CatalinB Sep 24 '16 at 12:07
  • @CatalinB Thank you but the output will then be like this: `Array([This is just a test post with the Swedish characters �, �, and Ö. Also as lower cased characters: �, �, and �.] => 1)` – Airikr Sep 24 '16 at 13:03

2 Answers2

1

Here is a solution with regex using Unicode punctuation to split the "words" then just a regular array occurrence count.

array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));

Produces:

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

This was tested in a unicode console, you might want to empose a encoding if you are using a browser. Either make a <meta> tag or set encoding within your browser, or send PHP headers.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
MarZab
  • 2,543
  • 21
  • 28
0

I managed to remove the � sign by adding ÅåÄäÖö into àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ.

Airikr
  • 6,258
  • 15
  • 59
  • 110