4

I'm working on a little hobby Python project that involves creating dictionaries for various languages using large bodies of text written in that language. For most languages this is relatively straightforward because I can use the space delimiter between words to tokenize a paragraph into words for the dictionary, but for example, Chinese does not use a space character between words. How can I tokenize a paragraph of Chinese text into words?

My searching has found that this is a somewhat complex problem, so I'm wondering if there are off the shelf solutions to solve this in Python or elsewhere via an api or any other language. This must be a common problem because any search engine made for asian languages would need to overcome this issue in order to provide relevant results.

I tried to search around using Google, but I'm not even sure what this type of tokenizing is called, so my results aren't finding anything. Maybe just a nudge in the right direction would help.

David
  • 6,462
  • 2
  • 25
  • 22
Dan Rice
  • 86
  • 5
  • possible duplicate of [How to do a Python split() on languages (like Chinese) that don't use whtespace as word separator?](http://stackoverflow.com/questions/3797746/how-to-do-a-python-split-on-languages-like-chinese-that-dont-use-whtespace) – Niklas B. May 19 '12 at 21:49
  • 1
    Also check the link provided in a deleted answer to that question: http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html – Niklas B. May 19 '12 at 21:52
  • @NiklasB.: I don't think so. The OP of the question you posted was looking for a way to split a string into characters. Mark Bryer's answer in that post seems like it may help, however. – Joel Cornett May 19 '12 at 21:57
  • 1
    @Joel: Hm I'm not sure. Quote: "I want to split a sentence into a list of words." You are right though that OP's own solution doesn't really solve the specific problem he asked about. He just uses the terms "word" and "character" as synonyms, which doesn't seem to be applicable to the Chinese language. Anyways, the answers there might be interesting. – Niklas B. May 19 '12 at 21:59
  • 2
    Considering that I don't speak a single language (wait a moment, latin should count theoretically!), that's guessing but that seems too ambiguous to solve with a hard and fast rule. I assume some NLP library is in order. Well or the simple solution with a dictionary in suffix tree form - that should be easy, although no idea how good it will work in practice – Voo May 19 '12 at 22:15
  • @Niklas B: OP only uses 'character' once, in 'space character'. What makes you say that he uses character as a synonym for word? – Junuxx May 19 '12 at 22:16
  • 1
    @Junuxx: In the question: "Each Chinese word/character has a corresponding unicode and is displayed on screen as an separate word/character.", "So obviously Python has no problem telling the word/character boundaries. I just need those words/characters in a list.". It becomes clearer if you look at OP's own answer, which suggests to just use `list` on the string. – Niklas B. May 19 '12 at 22:22
  • @NiklasB.: Misunderstanding on my part, I thought you meant OP of this question. – Junuxx May 19 '12 at 22:30
  • Can I suggest you narrow the scope of your question to just one Asian language? (E.g. it sounds like you are interested in Mandarin Chinese, not "character based languages?"). E.g. The correct answer for each of Japanese, Chinese and Korean is going to be different. (In fact even that is still too vague a question: what do you mean by "word"? Are you just interested in nouns? Do you want grammar words too? Do you want to normalize verbs into some kind of infinitive form?) – Darren Cook May 23 '12 at 23:09

1 Answers1

4

Language tokenization is a key aspect of Natural Language Processing (NLP). This is a huge topic for major corporations and universities and has been the subject of numerous PhD theses.

I just submitted an edit to your question to add the 'nlp' tag. I suggest you take a look at the "about" page for the 'nlp' tag. You'll find links to sites such as the Natural Language Tool Kit, which includes a Python-based tokenizer.

You can also search Google for terms like: "language tokenization" AND NLP.

David
  • 6,462
  • 2
  • 25
  • 22