Since Chinese is different from English, so how we can split a Chinese paragraph into sentences (in Python)? A Chinese paragraph sample is given as
我是中文段落,如何为我分句呢?我的宗旨是“先谷歌搜索,再来问问题”,我已经搜索了,但是没找到好的答案。
To my best knowledge,
from nltk import tokenize
tokenize.sent_tokenize(paragraph, "chinese")
does not work because tokenize.sent_tokenize()
doesn't support Chinese.
All the methods I found through Google search rely on Regular Expression (such as
re.split('(。|!|\!|\.|?|\?)', paragraph_variable)
). Those method are not complete enough. It seems that there is no a single regular expression pattern could be employed to split a Chinese paragraph into sentences correctly. I guess there should be some learned patterns to accomplish this task. But, I can't find them.