How to do word counts for a mixture of English and Chinese in Javascript

Question

I want to count the number of words in a passage that contains both English and Chinese. For English, it's simple. Each word is a word. For Chinese, we count each character as a word. Therefore, 香港人 is three words here.

So for example, "I am a 香港人" should have a word count of 6.

Any idea how can I count it in Javascript/jQuery?

Thanks!

score 11 · Accepted Answer · answered Dec 05 '13 at 12:01

Try a regex like this:

/[\u00ff-\uffff]|\S+/g

For example, "I am a 香港人".match(/[\u00ff-\uffff]|\S+/g) gives:

["I", "am", "a", "香", "港", "人"]

Then you can just check the length of the resulting array.

The \u00ff-\uffff part of the regex is a unicode character range; you probably want to narrow this down to just the characters you want to count as words. For example, CJK Unified would be \u4e00-\u9fcc.

function countWords(str) {
    var matches = str.match(/[\u00ff-\uffff]|\S+/g);
    return matches ? matches.length : 0;
}

Please test with this tring: "I am a 香港人 * * * * * * * | ] }" it's return 10. How to count word without special characters ? — Duc Manh Nguyen, Mar 25 '21 at 05:24

score 1 · Answer 2 · edited May 23 '17 at 11:54

1

It can't be 6, because when you calculate length of a string it includes spaces too. So,

var d = "I am a 香港人";
d.length //returns 10
d.replace(/\s+/g, "").length  //returns 7, excluding spaces

FYI: Your site should be properly encoded.

I think I found what you need. "I am a 香港人" this contains a repeated twice. So With the help of @PSL 's answer, I found a way.

var d = "I am a 香港人";
var uniqueList=d.replace(/\s+/g, '').split('').filter(function(item,i,allItems){
    return i==allItems.indexOf(item);
}).join('');
console.log(uniqueList.length);  //returns 6

JSFiddle

As you comments, I assume you sentence as "I am a 香港人" space between each word. Now I altered the code

var d = "I am a 香 港 人";

var uniqueList=d.split(' ').filter(function(item,i,allItems){
    return i==allItems.indexOf(item);
});
console.log(uniqueList.length);  //returns 6

JSFiddle

edited May 23 '17 at 11:54

Community

1
1

answered Dec 05 '13 at 10:12

Praveen

55,303
33
133
164

`\s+` is probably better, seeing as `[SPACE][SPACE]` shouldn't get misinterpreted as a word. – h2ooooooo Dec 05 '13 at 10:14
@h2ooooooo I got your point, but I have used global modifier with is working in http://jsfiddle.net/TuGHm/2/. Please correct me if I was wrong. – Praveen Dec 05 '13 at 10:33
It seems that the uniqueList created is "Iam香港人", and "a" is missing here. If you try some longer words, it will all go wrong. For example, "I am a good 香港人" will generate uniqueList variable as "Iamgod香港人", returning word count of 9 instead of 7. – user2335065 Dec 05 '13 at 11:18
To clarify, I want to count words, but not characters. So "This is a sentence" contains 4 words. But in Chinese, each character is considered as a word. – user2335065 Dec 05 '13 at 11:22
@user2335065 then word count will be 4 (I,am,a,香港人 ) – Praveen Dec 05 '13 at 11:35
it is not what user2335065 wants. but it is not difficult to just combine the method provided by Dagg Nabbit and Praveen. I.E, check whether the words are English or Chinese by matching the range of \u00ff-\uffff and then count them with two different ways. – fmchan Nov 18 '14 at 04:08
@Praveen, Chinese language is not a space-delimited language like English. To understand what is a `word` in Chinese, one can view it like this: while `character` (e.g.: a, b, c, d, e, ...) in English has no meaning, `word` (e.g.: hello, world, ...) has its own dictionary meaning. In the meanwhile, "`character`" (e.g.: 香, 港, 人, ...) in Chinese has its own dictionary meaning, thus it's intuitive to use `word` to represent a single Chinese character. Furthermore, "香港人" is considered a `phrase` which means "Hongkonger(s)". – cychoi Feb 13 '15 at 11:54

Ken Lee · Answer 3 · 2019-10-02T21:02:53.827

I have tried the script, but it will sometimes wrongly count the number of words. For example, some people will type "香港人computing都不錯的", but the script will count it as 4 words (using the following script).

<script>
var str = "香港人computing都不錯的";

  var matches = str.match(/[\u00ff-\uffff]|\S+/g);
    x= matches ? matches.length : 0;
    alert(x)
</script>

To fix the problem, I have changed the codes to:

<script>
var str="香港人computing都不錯的";

/// fix problem in special characters such as middle-dot, etc.   
str= str.replace(/[\u007F-\u00FE]/g,' ');

/// make a duplicate first...
var str1=str;
var str2=str;

/// the following remove all chinese characters and then count the number of english characters in the string
str1=str1.replace(/[^!-~\d\s]+/gi,' ')

/// the following remove all english characters and then count the number of chinese characters in the string
str2=str2.replace(/[!-~\d\s]+/gi,'')


var matches1 = str1.match(/[\u00ff-\uffff]|\S+/g);
var matches2 = str2.match(/[\u00ff-\uffff]|\S+/g);


count1= matches1 ? matches1.length : 0;
count2= matches2 ? matches2.length : 0;

/// return the total of the mixture
var lvar1= (count1+count2);

alert(lvar1);
</script>

Now the script counts the number of words in a mixture of chinese and english correctly.... Enjoy..

How to do word counts for a mixture of English and Chinese in Javascript

3 Answers3

JSFiddle

JSFiddle

Linked