As phrased in the question, I'm looking for a free and/or open-source text-segmentation algorithm for Chinese, I do understand it is a very difficult task to solve, as there are many ambiguities involed. I know there's google's API, but well it is rather a black-box, i.e. not many information of what it is doing are passing through.
4 Answers
The keyword text-segmentation for Chinese
should be 中文分词
in Chinese.
Good and active open-source text-segmentation algorithm :
- 盘古分词(Pan Gu Segment) :
C#
,Snapshot
- ik-analyzer :
Java
- ICTCLAS :
C/C++, Java, C#
,Demo
- NlpBamboo :
C, PHP, PostgreSQL
- HTTPCWS : based on
ICTCLAS
,Demo
- mmseg4j :
Java
- fudannlp :
Java
,Demo
- smallseg :
Python, Java
,Demo
- nseg : NodeJS
- mini-segmenter:
python
Other
Sample
Google Chrome (Chromium) :
src
,cc_cedict.txt (73,145 Chinese words/pharases)
In
text field
ortextarea
of Google Chrome with Chinese sentences, press Ctrl+← or Ctrl+→Double click
on中文分词指的是将一个汉字序列切分成一个一个单独的词
-
3Good list. How about [smallseg](http://code.google.com/p/smallseg/), does it qualify as good and active? – Wang Dingwei May 19 '11 at 10:11
-
1Best `Chinese text-segmentation` library for Python? – lschin May 19 '11 at 10:29
-
http://ictclas.org/index.html looks fantastic, even with part of speech – Sebastian May 19 '11 at 11:38
-
what does chrome use for segmenting text? – tofutim Aug 05 '12 at 00:43
Stanford segment using CRF algorithmn.
It's under GPL
link page is : http://nlp.stanford.edu/software/segmenter.shtml

- 81
- 1
- 3
ICU has details on universal text segmentation - http://userguide.icu-project.org/boundaryanalysis

- 15,730
- 13
- 73
- 112
Cursory Googling for "text segmentation chinese open source" reveals this library, which may or may not be what you're looking for...:
http://sourceforge.net/projects/ktdictseg/
The results hint at a few alternative venues to look for an open-source library, too:
- Searching for an open-source search implementation that might work with Chinese.
- Searching for an open-source plagiarism detection implementation that might with Chinese.

- 75,850
- 13
- 131
- 154