4

I want to count the number of words in a non-English sentence with PHP. For this I have tried str_word_count but it is not giving me the desired result, I don't want to use mb_strlen as it is giving me the length of the string. So please if someone can help me.

So, far I have done this,

function count_words($string) {
    $string = html_entity_decode($string);
    $string= str_replace("'", "'", $string);
    $t= array(' ', "t", '=', '+', '-', '*', '/', '', ',', '.', ';', ':', '[', ']', '{', '}', '(', ')', '<', '>', '&', '%', '$', '@', '#', '^', '!', '?', '~'); // separators
    $string= str_replace($t, " ", $string);
    $string= trim(preg_replace("/s+/", " ", $string));
    $num= 0;
    if (my_strlen($string)>0) {
        $word_array= explode(" ", $string);
        $num= count($word_array);
    }
    return $num;
}

$string = "আমি 'আমার' দেশ, ভারতকে ভালবাসি";
echo count_words($string);

It needs to give me the output of 5 but giving 6, I found out that the problem is happening when i am using any comma or inverted commas, so how can i correct that also i just want to show 3 words from it. How is it possible to do.

Baby Babai
  • 197
  • 1
  • 10
  • The reason you get back 6 is because the word `দেশ,`'s comma is replaced by a whitespace. Which to my guess would be a character you rather remove with `''` an empty string. Or alternatively you could filter empty strings from `$word_array` – Remy Jan 31 '21 at 08:16
  • @Remy yes can you show me how to do that – Baby Babai Jan 31 '21 at 08:17
  • Does this answer your question? [(grep) Regex to match non-ASCII characters?](https://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters) – RïshïKêsh Kümar Jan 31 '21 at 08:31
  • @RïshïKêshKümar No mate – Baby Babai Jan 31 '21 at 08:33

2 Answers2

2

Your array $t contains some symbols that are used to create additional white spacing in the original string. Since this is the character (' ') you use in explode, the array $word_array will contain empty strings for each such additional whitespace.

To get rid of those empty strings that most certainly are not words, you could simply filter the array in the end as is done now.

Finally, if you want to work with the words of your string, the function could return the array of words. You can then take out the 3 or how many ever words you want from the array.

$string = "আমি 'আমার' দেশ, ভারতকে ভালবাসি";

function words($string)
{
    $string = html_entity_decode($string);
    $string = str_replace("'", "'", $string);
    $t = array(' ', "t", '=', '+', '-', '*', '/', '', ',', '.', ';', ':', '[', ']', '{', '}', '(', ')', '<', '>', '&', '%', '$', '@', '#', '^', '!', '?', '~'); // separators
    $string = str_replace($t, " ", $string);
    $string = trim(preg_replace("/s+/", " ", $string));

    $word_array = [];
    if (my_strlen($string) > 0) {
        $word_array = explode(" ", $string);

        // Filter out those redundant empty strings that might be a by-product
        // of replacing characters from $t with a whitespace ' ' and explode.
        $word_array = array_filter($word_array, function ($word) {
            return $word !== '';
        });
        // PHP 7.4
        // $word_array = array_filter($word_array, fn ($word) => $word !== '');
    }

    return $word_array;
}

$words = words($string);
echo count($words) . PHP_EOL;

// Additionally you could output the first 3 words
echo implode(' ', array_slice($words, 0, 3));

Result:

5
আমি 'আমার' দেশ

Whether your original function is reliable in counting words is of course something you would have to double check yourself.

Remy
  • 777
  • 2
  • 8
  • 15
1

You should use regex like this:

<?php
$string = "আমি 'আমার' দেশ, ভারতকে ভালবাসি";
$pattern = '/[^\x00-\x7F]+/';
echo preg_match_all($pattern, $string);
?>

EDIT Answer to your whole question:

<?php
$string = "আমি 'আমার' দেশ, ভারতকে ভালবাসি";
$pattern = '/[^\x00-\x7F]+/';
$words = preg_match_all($pattern, $string, $res);
echo $res[0][0] . " " . $res[0][1] . " " . $res[0][2]
?>
Abolfazl Mohajeri
  • 1,734
  • 2
  • 14
  • 26