3

I need to count number of words in UTF-8 string. ie I need to write a python function which takes "एक बार,एक कौआ, बहुत प्यासा, था" as input and returns 7 ( number of words ).

I tried regular expression "\b" as shown below. But result are inconsistent.

wordCntExp=re.compile(ur'\b',re.UNICODE);
sen='एक बार,एक कौआ, बहुत प्यासा, था';
print len(wordCntExp.findall(sen.decode('utf-8'))) >> 1;
12 

Any interpretation of the above answer or any other approaches to solve the above problem are appreciated.

mnagel
  • 6,729
  • 4
  • 31
  • 66
user2586432
  • 249
  • 1
  • 4
  • 12

3 Answers3

5

try to use:

import re
words = re.split(ur"[\s,]+",sen, flags=re.UNICODE)
count = len(words)

It will split words divided by whitespaces and commas. You can add other characters into first argument that are not considered as characters belonging to a word.

inspired by this

python re documentation

Community
  • 1
  • 1
nio
  • 5,141
  • 2
  • 24
  • 35
0

I don't know anything about your language's structure, but can't you simply count the spaces?

>>> len(sen.split()) + 1
7

note the + 1 because there are n - 1 spaces. [edited to split on arbitrary length spaces - thanks @Martijn Pieters]

danodonovan
  • 19,636
  • 10
  • 70
  • 78
  • In which case you'd use `.split()` to split on arbitrary-width whitespace. But that won't work because in the sample sentence is one comma without trailing whitespace, which may be permissible in Hindi for all we know. – Martijn Pieters Jul 16 '13 at 08:46
  • I can't depend only on spaces because, between words there can be multiple spaces. >>> sen='एक बार,एक कौआ, बहुत प्यासा, था'; >>> len(sen.split(" "))+1 8 – user2586432 Jul 16 '13 at 08:47
0

Using regex:

>>> import regex
>>> sen = 'एक बार,एक कौआ, बहुत प्यासा, था'
>>> regex.findall(ur'\w+', sen.decode('utf-8'))
[u'\u090f\u0915', u'\u092c\u093e\u0930', u'\u090f\u0915', u'\u0915\u094c\u0906', u'\u092c\u0939\u0941\u0924', u'\u092a\u094d\u092f\u093e\u0938\u093e', u'\u0925\u093e']
>>> len(regex.findall(ur'\w+', sen.decode('utf-8')))
7
falsetru
  • 357,413
  • 63
  • 732
  • 636