php - count number of instances of a word in an array supporting UTF8

Question

i'm creating a jquery tagcloud in a php site. in my mysql db i have a 'tags' field where there will be a list of comma separated words. i want to produce an array of words with the frequency with which they appear. just to complicate things, the text will all be in hebrew (utf8 encoded).

in english this solution works perfectly :

$words = array_count_values(str_word_count($str, 1));
print_r($words);

taken from here php: sort and count instances of words in a given string

with hebrew text the array is not filled.

i found this post str_word_count() function doesn't display Arabic language properly and, while it works, it only gives a total count of the number of words, and doesn't create an array of results like the previous function does.

i'd like the results to look something like this :

Array
(
    [happy] => 4
    [beautiful] => 1
    [lines] => 3
    [pear] => 2
    [gin] => 1
    [rock] => 1
)

any suggestions?

Instead of `str_word_count`, why not `explode` on the comma? — deceze, Jun 18 '13 at 14:12

score 2 · Answer 1 · answered Jun 18 '13 at 13:58

Although this is not exactly the answer you are hoping for, I would encourage you at first to rethink your DB-Design. Saving several tags comma separated in one field is not very clever. You should build a separated table for the Tags with only two columns:

tag
id of corresponding object/post or whatever your application is about

There are many advantages:

It's easier to remove or add tags.
You can get the array you're looking for without some crappy php-code with a single SQL-Query like "select tag, count(id) from tags group by tag"
That's easier and MUCH MORE faster when you have many Tags.
Last but not least I would bet (without being sure), that MySQL won't have the Problems with different Alphabets you obviously get in php-

thats a fair point... i'm working with an existing CRM and i think it'll be easier for users to be able to simply add a comma separated list to a text box than it would be to add "items" to each record... on saying that, it would actually be fairly simple for me to implement your suggestion... if any php solutions don't work i'll try out yours :) — Dog, Jun 19 '13 at 06:31

Jon · Accepted Answer · 2013-06-18T15:25:26.803

It is possible to make a UTF-8 (only!) version using the Unicode mode of PHP's PCRE functions.

function utf8_str_word_count($string, $format = 0, $charlist = null) {
    if ($charlist === null) {
        $regex = '/\\pL[\\pL\\p{Mn}\'-]*/u';
    }
    else {
        $split = array_map('preg_quote', 
                           preg_split('//u',$charlist,-1,PREG_SPLIT_NO_EMPTY));
        $regex = sprintf('/(\\pL|%1$s)([\\pL\\p{Mn}\'-]|%1$s)*/u',
                         implode('|', $split));
    }

    switch ($format) {
        default:
        case 0:
            // For PHP >= 5.4.0 this is fine:
            return preg_match_all($regex, $string);

            // For PHP < 5.4 it's necessary to do this:
            // $results = null;
            // return preg_match_all($regex, $string, $results);
        case 1:
            $results = null;
            preg_match_all($regex, $string, $results);
            return $results[0];
        case 2:
            $results = null;
            preg_match_all($regex, $string, $results, PREG_OFFSET_CAPTURE);
            return empty($results[0])
                ? array()
                : array_combine(
                      array_map('end', $results[0]), 
                      array_map('reset', $results[0]));
    }
}

This function follows the semantics of str_word_count as closely as possible; in particular, if you replace "locale dependent" with "UTF-8" in the following note for str_word_count the result holds true for this

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

Additionally, the characters ' and - are considered part of a word but cannot start one; however, any characters specified in the $charlist parameter can start a word which means that specifying ' and/or - slightly changes the way the function works. This behavior also matches the original str_word_count.

It is also interesting to note that you could make the function recognize only some subset of Unicode scripts by appropriately replacing \pL with character properties such as \p{Greek} -- see the PCRE Unicode reference.

I'll note that this fails for languages which have no specific word separators, like Chinese or Japanese... — deceze, Jun 18 '13 at 15:15
@deceze: True, but there's nothing that can be done about that. — Jon, Jun 18 '13 at 15:25

php - count number of instances of a word in an array supporting UTF8

2 Answers2