How to split a Chinese paragraph into sentences in Python?

Question

Since Chinese is different from English, so how we can split a Chinese paragraph into sentences (in Python)? A Chinese paragraph sample is given as

我是中文段落，如何为我分句呢？我的宗旨是“先谷歌搜索，再来问问题”，我已经搜索了，但是没找到好的答案。

To my best knowledge,

from nltk import tokenize
tokenize.sent_tokenize(paragraph, "chinese")

does not work because tokenize.sent_tokenize() doesn't support Chinese.

All the methods I found through Google search rely on Regular Expression (such as

re.split('(。|！|\!|\.|？|\?)', paragraph_variable)

). Those method are not complete enough. It seems that there is no a single regular expression pattern could be employed to split a Chinese paragraph into sentences correctly. I guess there should be some learned patterns to accomplish this task. But, I can't find them.

Can you post more info on chinese paragraphs? How can they be detected? Are there always special signs? — user8408080, Nov 14 '18 at 11:08
related: https://stackoverflow.com/questions/27441191/splitting-chinese-document-into-sentences — EdChum, Nov 14 '18 at 11:08
@RahulAgarwal, that's a possibility. But, the mapping between Chinese punctuation marks and English ones seems to be another problem. — Ian, Nov 14 '18 at 11:23
I am not sure about the nature of the project, but if you are not doingthis for educational purposes that I suggest but a Google translator API which will help you to convert this..then apply python to break to sentences — Rahul Agarwal, Nov 14 '18 at 11:29

How to split a Chinese paragraph into sentences in Python?

0 Answers0