3

Since Chinese is different from English, so how we can split a Chinese paragraph into sentences (in Python)? A Chinese paragraph sample is given as

我是中文段落,如何为我分句呢?我的宗旨是“先谷歌搜索,再来问问题”,我已经搜索了,但是没找到好的答案。

To my best knowledge,

from nltk import tokenize
tokenize.sent_tokenize(paragraph, "chinese")

does not work because tokenize.sent_tokenize() doesn't support Chinese.

All the methods I found through Google search rely on Regular Expression (such as

re.split('(。|!|\!|\.|?|\?)', paragraph_variable)

). Those method are not complete enough. It seems that there is no a single regular expression pattern could be employed to split a Chinese paragraph into sentences correctly. I guess there should be some learned patterns to accomplish this task. But, I can't find them.

Ian
  • 160
  • 11
  • What about converting to English and then do it ? – Rahul Agarwal Nov 14 '18 at 11:06
  • Can you post more info on chinese paragraphs? How can they be detected? Are there always special signs? – user8408080 Nov 14 '18 at 11:08
  • 1
    related: https://stackoverflow.com/questions/27441191/splitting-chinese-document-into-sentences – EdChum Nov 14 '18 at 11:08
  • @RahulAgarwal, that's a possibility. But, the mapping between Chinese punctuation marks and English ones seems to be another problem. – Ian Nov 14 '18 at 11:23
  • I am not sure about the nature of the project, but if you are not doingthis for educational purposes that I suggest but a Google translator API which will help you to convert this..then apply python to break to sentences – Rahul Agarwal Nov 14 '18 at 11:29

0 Answers0