Count occurrences of each word in UTF-8 string

Question

Am stuck with a problem assigned by my mentor who wants me to write a PHP script which reads a UTF-8 encoded file content and return the number of occurrences of each word as json.

The language of the file content is Russian Cyrillic.

Here is the Sample text

Он бледен. Мыслит страшный путь.
В его душе живут виденья.
Ударом жизни вбита грудь,
А щеки выпили сомненья.

Клоками сбиты волоса,
Чело высокое в морщинах,
Но ясных грез его краса
Горит в продуманных картинах.

Сидит он в тесном чердаке,
Огарок свечки режет взоры,
А карандаш в его руке
Ведет с ним тайно разговоры.

Он пишет песню грустных дум,
Он ловит сердцем тень былого.
И этот шум… душевный шум…
Снесет он завтра за целковый.

As per my research, PHP's predefined string functions can seamlessly handle this problem only under a condition that it must be ASCII encoded. In this case do we have some third party libraries or APIs to handle utf-8 encoded strings of other non English language.

"PHP's predefined string functions can seamlessly handle this problem only under a condition that it must be ASCII encoded" - There're functions that support Unicode (custom encodings, to be precise), it's just that not all of PHP functions do (they were written before UTF-8 got popular). — Álvaro González, Nov 25 '22 at 19:09

Andrey Bessonov · Answer 1 · 2022-11-25T19:25:42.963

Use mb_string functions:

<?php

$str = "Он бледен. Мыслит страшный путь.
        В его душе живут виденья.
        Ударом жизни вбита грудь,
        А щеки выпили сомненья.

        Клоками сбиты волоса,
        Чело высокое в морщинах,
        Но ясных грез его краса
        Горит в продуманных картинах.

        Сидит он в тесном чердаке,
        Огарок свечки режет взоры,
        А карандаш в его руке
        Ведет с ним тайно разговоры.

        Он пишет песню грустных дум,
        Он ловит сердцем тень былого.
        И этот шум… душевный шум…
        Снесет он завтра за целковый.";


$words = preg_split('/[ .,-?!:;\'"\n\r]+/', mb_strtolower($str));

$mp = [];

foreach ($words as $word) {
    if (!mb_strlen($word)) continue;
    
    if (!isset($mp[$word])) {
        $mp[$word] = 0;
    }
    $mp[$word]++;
}

var_dump($mp);

jspit · Answer 2 · 2022-11-26T10:17:56.240

This solution splits the text into an array which then contains the words. The regular expression is just an approach and needs improvement. The evaluation is then done with array_count_values.

$result = array_count_values(preg_split("~[ ,.;\r\n\t]+~u",$str, -1, PREG_SPLIT_NO_EMPTY));

A distinction is made here between upper and lower case.

Demo: https://3v4l.org/5iVnX

If only the first letter of a word is not to be distinguished or if words such as make-up are to be recognized as one word, the code becomes somewhat more extensive.

function mb_word_count($string, $mode = MB_CASE_TITLE, $characters = null){
  $string = mb_convert_case($string, $mode, "UTF-8");
  $addChars = $characters ? preg_quote($characters, '~') : "";
  $regEx = "~[^\p{L}0-9".$addChars."]+~u";
  return array_count_values(preg_split($regEx,$string, -1, PREG_SPLIT_NO_EMPTY));
}

The mb_word_count function returns an array with all words as a key and their number as a value. Artificial words like str2hex are recognized as one word. Only the first letter of a word is not differentiated between upper and lower case (MB_CASE_TITLE). Other modes can be chosen from mb_convert_case. Additional characters can be defined with $characters, which are then treated like a letter.

I changed my mind about stealing =)) – Andrey Bessonov Nov 25 '22 at 19:33 — Andrey Bessonov, Nov 25 '22 at 19:33
I've also edited and expanded my answer again. – jspit Nov 26 '22 at 10:21 — jspit, Nov 26 '22 at 10:21

Count occurrences of each word in UTF-8 string

2 Answers2