7

I've made the next function to return a specific number of words from a text:

function brief_text($text, $num_words = 50) {
    $words = str_word_count($text, 1);
    $required_words = array_slice($words, 0, $num_words);
    return implode(" ", $required_words);
}

and it works pretty well with English language but when I try to use it with Arabic language it fails and doesn't return words as expected. For example:

$text_en = "Cairo is the capital of Egypt and Paris is the capital of France";
echo brief_text($text_en, 10);

will output Cairo is the capital of Egypt and Paris is the while

$text_ar = "القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا";
echo brief_text($text_ar, 10); 

will output � � � � � � � � � �.

I know that the problem is with the str_word_count function but I don't know how to fix it.

UPDATE

I have already written another function that works pretty good with both English and Arabic languages, but I was looking for a solution for the problem caused by str_word_count() function when using with Arabic. Anyway here is my other function:

    function brief_text($string, $number_of_required_words = 50) {
        $string = trim(preg_replace('/\s+/', ' ', $string));
        $words = explode(" ", $string);
        $required_words = array_slice($words, 0, $number_of_required_words); // get sepecific number of elements from the array
        return implode(" ", $required_words);
    }
Amr
  • 4,809
  • 6
  • 46
  • 60
  • 4
    Please vote here for a `mb_str_word_count()` function: https://bugs.php.net/bug.php?id=63671 – ComFreek Dec 14 '12 at 18:20
  • @Amr I only speak Spanish and a little English, you could publish a list of Arabic words that have a space in them – rkmax Dec 14 '12 at 20:32
  • Try to use this function, I have tried it and it works great. https://stackoverflow.com/a/64319676/3604226 – Mo'men Mohamed Oct 12 '20 at 14:21

6 Answers6

3

Try with this function for word count:

// You can call the function as you like
if (!function_exists('mb_str_word_count'))
{
    function mb_str_word_count($string, $format = 0, $charlist = '[]') {
        mb_internal_encoding( 'UTF-8');
        mb_regex_encoding( 'UTF-8');

        $words = mb_split('[^\x{0600}-\x{06FF}]', $string);
        switch ($format) {
            case 0:
                return count($words);
                break;
            case 1:
            case 2:
                return $words;
                break;
            default:
                return $words;
                break;
        }
    };
}



echo mb_str_word_count("القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا") . PHP_EOL;

Resources

Recommentations

  • Use the tag <meta charset="UTF-8"/> in HTML files
  • Always add Content-type: text/html; charset=utf-8 headers when serving pages
rkmax
  • 17,633
  • 23
  • 91
  • 176
  • AFAIK that word count will not work for Arabic text. In Arabic like languages there are some words that they have spaces in them. – Arash Milani Dec 14 '12 at 18:34
  • 2
    You can make a list of characters that are separators. adecional can build exceptions, I suppose these words with spaces inside must have some rule that can be programmed. – rkmax Dec 14 '12 at 18:49
  • 1
    1+ for the Resources on the "A Rule-Based Arabic Stemming Algorithm" :) – Arash Milani Dec 14 '12 at 18:56
  • 1
    I wouldn't call that function `mb_str_word_count()` because it could break the whole system if the native function gets ever implemented. – ComFreek Dec 14 '12 at 19:12
  • ok go ahead and rename it to something else. the answer has been update indeed better approach :-) 1+ for that `function_exists` part. – Arash Milani Dec 14 '12 at 19:14
  • 1
    I've just added $format and $charlist parameter so nobody gets confused if he's only using the polyfill. – ComFreek Dec 14 '12 at 20:15
  • doesnt workk for some languages (i.e. eastern europian UNICODE) – T.Todua Jun 03 '15 at 08:56
  • @rkmax Thanks...it's work for Persian(Farsi).All arabic and persian and other language users ...please come here(https://bugs.php.net/bug.php?id=63671) and vote for `mb_str_word_count` – Mostafa Jul 05 '16 at 09:41
2

For accepting ASCII characters too:

if (!function_exists('mb_str_word_count'))
{
    function mb_str_word_count($string, $format = 0, $charlist = '[]') {
        $string=trim($string);
        if(empty($string))
            $words = array();
        else
            $words = preg_split('~[^\p{L}\p{N}\']+~u',$string);
        switch ($format) {
            case 0:
                return count($words);
                break;
            case 1:
            case 2:
                return $words;
                break;
            default:
                return $words;
                break;
        }
    }
}
ahoo
  • 1,321
  • 2
  • 17
  • 37
1

hi friend if you want to get count of word in Farsi language or Arabic you can use below code

public function customWordCount($content_text)
{
    $resultArray = explode(' ',trim($content_text));
    foreach ($resultArray as $key => $item)
    {
        if (in_array($item,["|",";",".","-","=",":","{","}","[","]","(",")"]))
        {
            $resultArray[$key] = '';
        }
    }

    $resultArray = array_filter($resultArray);
    return count($resultArray);
}
1

I would change all letters to a random English letter and count it

str_word_count(preg_replace("/[\x{0600}-\x{06FF}a-zA-Z]/u", "a", "أشهد أن لا إله إلا الله"))
Sarout
  • 821
  • 4
  • 25
0

A while ago I wanted to calculate the reading time of a paragraph and had the same issue and I just simply count the SPACEs in the paragraph :) (note that it won't be that accurate but it suits me)

like this:

substr_count($text, ' ') + 1;
Erfan Paslar
  • 110
  • 1
  • 8
  • 1
    Good idea. And you can make it more accurate by replacing any multiple spaces with only one space before using the `substr_count` method. – Amr Nov 29 '21 at 13:43
  • yeah that's right it will give the exact words but in my case, it didn't matter that much. – Erfan Paslar Nov 29 '21 at 14:04
0

My PHP 8.1 solution is;

if (!function_exists('mb_str_word_count')) {
    function mb_str_word_count($string, $format = 0): array|bool|int
    {
        return match ($format) {
            1 => get_words($string),
            2 => get_words($string, count_word_order_as_index: true),
            default => count(get_words($string)),
        };
    }


}


function get_words(string $string, $count_word_order_as_index = false): array
{
    $letters = mb_str_split($string);
    $words = [];

    if ($count_word_order_as_index) {
        $count_word_order_as_index_count = 0;
    }
    $word = '';
    $total_letters = count($letters);
    foreach ($letters as $key => $letter) {
        if ($count_word_order_as_index) {
            $count_word_order_as_index_count++;
        }
        if ($letter !== ' ') {
            $word .= $letter;
            if ($total_letters === $key + 1) {
                $words[] = $word;
            }
        } else {
            if ($count_word_order_as_index) {
                $words[$count_word_order_as_index_count] = $word;
            } else {
                $words[] = $word;
            }

            $word = '';
        }
    }
    return $words;
}