The sentence may include non-english characters, e.g. Chinese:
你好,hello world
the expected value for the length is 5
(2 Chinese characters, 2 English words, and 1 comma)
The sentence may include non-english characters, e.g. Chinese:
你好,hello world
the expected value for the length is 5
(2 Chinese characters, 2 English words, and 1 comma)
You can use that most Chinese characters are located in the unicode range 0x4e00 - 0x9fcc.
# -*- coding: utf-8 -*-
import re
s = '你好 hello, world'
s = s.decode('utf-8')
# First find all 'normal' words and interpunction
# '[\x21-\x2f]' includes most interpunction, change it to ',' if you only need to match a comma
count = len(re.findall(r'\w+|[\x21-\x2]', s))
for word in s:
for ch in word:
# see https://stackoverflow.com/a/11415841/1248554 for additional ranges if needed
if 0x4e00 < ord(ch) < 0x9fcc:
count += 1
print count
If you're happy to consider each Chinese character as a separate word even though that isn't always the case, you could possibly accomplish something like this by examining the Unicode character property of each character, using the unicodedata module.
For example, if you run this code on your example text:
# -*- coding: utf-8 -*-
import unicodedata
s = u"你好,hello world"
for c in s:
print unicodedata.category(c)
You'll see the chinese characters are reported as Lo
(letter other) which is different from Latin characters which would typically be reported as Ll
or Lu
.
Knowing that, you could consider anything that is Lo
to to be an individual word, even if it isn't separated by whitespace/punctuation.
Now this almost definitely won't work in all cases for all languages, but it may be good enough for your needs.
Update
Here is a more complete example of how you could do it:
# -*- coding: utf-8 -*-
import unicodedata
s = u"你好,hello world"
wordcount = 0
start = True
for c in s:
cat = unicodedata.category(c)
if cat == 'Lo': # Letter, other
wordcount += 1 # each letter counted as a word
start = True
elif cat[0] == 'P': # Some kind of punctuation
wordcount += 1 # each punctation counted as a word
start = True
elif cat[0] == 'Z': # Some kind of separator
start = True
else: # Everything else
if start:
wordcount += 1 # Only count at the start
start = False
print wordcount
There is a problem with the logic here:
你好
,
These are all characters, not words. For the Chinese characters you will need to do something possibly with regex
The problem here is that chinese Characters might be word parts or words.
大好
In Regex, is that one or two words? Each character alone is a word, but together they are also one word.
hello world
If you count this on spaces, then you get 2 words, but also your Chinese regex might not work.
I think the only way you can make this work for "words" is to work out the Chinese and English separately.