4

What I want to achieve is to get the word count in a multi-language text.

Like if I have a text has both English and Chinese: The last Olympics was held in 北京, the count should be 8, because there's six English words and two Chinese characters, like the word count in Microsoft Word.

What's the best way to do that in Ruby and in JavaScript?

Jonathan Eustace
  • 2,469
  • 12
  • 31
  • 54
larryzhao
  • 3,173
  • 2
  • 40
  • 62
  • 1
    Why is 北京 two words? Do you really have in mind a multi language text in general, or just English and Chinese? – sawa Sep 19 '12 at 05:24
  • 1
    @sawa `北京` should be counted as two words in Chinese, although they are also two Chinese characters, I am sure about it since I am Chinese. It's different from English. It would be the best that the solution covers all languages, Chinese and English could be first step. – larryzhao Sep 19 '12 at 05:33
  • 2
    Can I ask why this is downvoted? – larryzhao Sep 19 '12 at 05:33
  • Somebody's got a bad day I guess, it's a legit question. Check my answer, I think that's the result you're looking for. – elclanrs Sep 19 '12 at 05:34

2 Answers2

3

I have a solution based on "how can i detect cjk characters in a string in ruby".

s = 'The last Olympics was held in 北京'
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end
s.split.inject(0) do |sum, word|
  if word.contains_cjk?
    sum += word.length   # => ONLY work in Ruby 1.9. 
                         #    Search for other methods to do this for 1.8
  else
    sum += 1
  end
end
Community
  • 1
  • 1
halfelf
  • 9,737
  • 13
  • 54
  • 63
2

You could try this in JavaScript. It basically gets the symbols by excluding every character possible in English. I might've forgotten some character and it may not work with other languages that have extra special characters but give it try. I'm using jQuery's $.trim function for brevity but you could also use "How do I trim a string in javascript?".

Demo: http://jsbin.com/otusuv/7/edit

var str = 'The last Olympics 隶草 was held in 北京';
var words = '', symbols = '';
str.replace(/([\w\s]*)([^\w;,.'"{}\[\]+_)(*&\^%$#@!~\/?]*)/g, function(a,b,c) {
    words += b;
    symbols += c;
});
words = $.trim(words).split(' ');
symbols = symbols.replace(' ', '').split('');

var total_words = words.length + symbols.length

You may also want to try XRegExp. It's a JavaScript library that enhances regex and has some nice features.

Community
  • 1
  • 1
elclanrs
  • 92,861
  • 21
  • 134
  • 171