-2

The sentence may include non-english characters, e.g. Chinese:

你好,hello world

the expected value for the length is 5 (2 Chinese characters, 2 English words, and 1 comma)

askewchan
  • 45,161
  • 17
  • 118
  • 134
liuzhijun
  • 4,329
  • 3
  • 23
  • 27
  • 4
    Is there no space between the Chinese characters? Because then it could be quite impossible to distinguish between normal letters and Chinese characters. – BrtH May 13 '13 at 17:47
  • Sounds like you need to do quite a bit of NLP. Unfortunately, I'm not very aware of NLP libraries in python that support any Chinese language. So unless you have some pretty accurate heuristics about figuring out which Chinese characters are to be considered separate words, this quickly becomes impossible to do, given the current technology that I am aware of – inspectorG4dget May 13 '13 at 17:57
  • yes,as the title said,I just want to find the length of an artice. artice maybe includes English word and Chinese characters.I think @James Holderness's answer can help me. – liuzhijun May 14 '13 at 04:19

3 Answers3

2

You can use that most Chinese characters are located in the unicode range 0x4e00 - 0x9fcc.

# -*- coding: utf-8 -*-
import re

s = '你好 hello, world'
s = s.decode('utf-8')

# First find all 'normal' words and interpunction
# '[\x21-\x2f]' includes most interpunction, change it to ',' if you only need to match a comma
count = len(re.findall(r'\w+|[\x21-\x2]', s))

for word in s:
    for ch in word:
        # see https://stackoverflow.com/a/11415841/1248554 for additional ranges if needed
        if 0x4e00 < ord(ch) < 0x9fcc:
            count += 1

print count
Community
  • 1
  • 1
BrtH
  • 2,610
  • 16
  • 27
  • I quite like this answer, but I suspect you'll miss out on a fair amount of punctuation with that regex. If there are Chinese characters in the input, I think it's fair to expect punctuation to use the full width variants, at least some of the time. – James Holderness May 13 '13 at 18:38
  • @JamesHolderness: Luckily, full width punctutation falls in the range I use, it is [0xff00 - 0xffef](http://www.fileformat.info/info/unicode/block/halfwidth_and_fullwidth_forms/index.htm). But yes, I'm sure there are some things that are missing, but it depends on what liuzhijun needs. – BrtH May 13 '13 at 18:49
  • Surely not? You are matching `0x4e00` to `0x9fcc` - how does `0xff00` fall in that range? Am I missing something? That said, I do agree it depends on the OP's needs. – James Holderness May 13 '13 at 18:58
  • @JamesHolderness: Right, don't know why I thought that, you are right of course. – BrtH May 13 '13 at 19:03
1

If you're happy to consider each Chinese character as a separate word even though that isn't always the case, you could possibly accomplish something like this by examining the Unicode character property of each character, using the unicodedata module.

For example, if you run this code on your example text:

# -*- coding: utf-8 -*-

import unicodedata

s = u"你好,hello world"     
for c in s:
  print unicodedata.category(c)

You'll see the chinese characters are reported as Lo (letter other) which is different from Latin characters which would typically be reported as Ll or Lu.

Knowing that, you could consider anything that is Lo to to be an individual word, even if it isn't separated by whitespace/punctuation.

Now this almost definitely won't work in all cases for all languages, but it may be good enough for your needs.

Update

Here is a more complete example of how you could do it:

# -*- coding: utf-8 -*-

import unicodedata

s = u"你好,hello world"     

wordcount = 0
start = True
for c in s:      
  cat = unicodedata.category(c)
  if cat == 'Lo':        # Letter, other
    wordcount += 1       # each letter counted as a word
    start = True                       
  elif cat[0] == 'P':    # Some kind of punctuation
    wordcount += 1       # each punctation counted as a word
    start = True                       
  elif cat[0] == 'Z':    # Some kind of separator
    start = True
  else:                  # Everything else
    if start:
      wordcount += 1     # Only count at the start
    start = False    

print wordcount    
James Holderness
  • 22,721
  • 2
  • 40
  • 52
0

There is a problem with the logic here:

你好
,

These are all characters, not words. For the Chinese characters you will need to do something possibly with regex

The problem here is that chinese Characters might be word parts or words.

大好

In Regex, is that one or two words? Each character alone is a word, but together they are also one word.

hello world

If you count this on spaces, then you get 2 words, but also your Chinese regex might not work.

I think the only way you can make this work for "words" is to work out the Chinese and English separately.

Community
  • 1
  • 1
Austin T French
  • 5,022
  • 1
  • 22
  • 40
  • I think the OP, being located in China, likely knows the difference between Chinese characters and words. The question does refer to Chinese characters and English words. – Fred Larson May 13 '13 at 18:03
  • @FredLarson Being a QA site, answers are for posterity and not just the OP, correct? Also I was just spelling it out, because we all just need it a little more clearly sometimes to see where we went off track. – Austin T French May 13 '13 at 18:05