29

As phrased in the question, I'm looking for a free and/or open-source text-segmentation algorithm for Chinese, I do understand it is a very difficult task to solve, as there are many ambiguities involed. I know there's google's API, but well it is rather a black-box, i.e. not many information of what it is doing are passing through.

madth3
  • 7,275
  • 12
  • 50
  • 74
Sebastian
  • 6,293
  • 6
  • 34
  • 47

4 Answers4

30

The keyword text-segmentation for Chinese should be 中文分词 in Chinese.

Good and active open-source text-segmentation algorithm :

  1. 盘古分词(Pan Gu Segment) : C#, Snapshot
  2. ik-analyzer : Java
  3. ICTCLAS : C/C++, Java, C#, Demo
  4. NlpBamboo : C, PHP, PostgreSQL
  5. HTTPCWS : based on ICTCLAS, Demo
  6. mmseg4j : Java
  7. fudannlp : Java, Demo
  8. smallseg : Python, Java, Demo
  9. nseg : NodeJS
  10. mini-segmenter: python

Other

  1. Google Code : http://code.google.com/query/#q=中文分词
  2. OSChina (Open Source China)

Sample

  1. Google Chrome (Chromium) : src, cc_cedict.txt (73,145 Chinese words/pharases)

    • In text field or textarea of Google Chrome with Chinese sentences, press Ctrl+ or Ctrl+

    • Double click on 中文分词指的是将一个汉字序列切分成一个一个单独的词

alvas
  • 115,346
  • 109
  • 446
  • 738
lschin
  • 6,745
  • 2
  • 38
  • 52
8

Stanford segment using CRF algorithmn.

It's under GPL

link page is : http://nlp.stanford.edu/software/segmenter.shtml

ShanJay
  • 81
  • 1
  • 3
1

ICU has details on universal text segmentation - http://userguide.icu-project.org/boundaryanalysis

Phyxx
  • 15,730
  • 13
  • 73
  • 112
0

Cursory Googling for "text segmentation chinese open source" reveals this library, which may or may not be what you're looking for...:

http://sourceforge.net/projects/ktdictseg/

The results hint at a few alternative venues to look for an open-source library, too:

  • Searching for an open-source search implementation that might work with Chinese.
  • Searching for an open-source plagiarism detection implementation that might with Chinese.
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154