18
ഇതുഒരുസ്ടലംമാണ്  

itu oru stalam anu

This is a Unicode string meaning this is a place

import nltk
nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))

is not working for me .

nltk.word_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))

is also not working other examples

"കണ്ടില്ല "  = കണ്ടു +ഇല്ല,
"വലിയൊരു"  = വലിയ + ഒരു

Right Split :

ഇത്  ഒരു സ്ഥാലം ആണ് 

output:

[u'\u0d07\u0d24\u0d4d\u0d12\u0d30\u0d41\u0d38\u0d4d\u0d25\u0d32\u0d02\u0d06\u0d23\u0d4d']

I just need to split the words as shown in the other example. Other example section is for testing.The problem is not with Unicode. It is with morphology of language. for this purpose you need to use a morphological analyzer
Have a look at this paper. http://link.springer.com/chapter/10.1007%2F978-3-642-27872-3_38

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
  • Is your source file in unicode? Try declaring this at the top of the file `# -*- coding: utf-8 -*- ` – StoryTeller - Unslander Monica Oct 22 '13 at 10:55
  • @StoryTeller let me check –  Oct 22 '13 at 10:56
  • Also, if this is python 2.* a unicode string should be prefixed with `u`, like this `u'ഇത്ഒരുസ്ഥലംആണ്'` – StoryTeller - Unslander Monica Oct 22 '13 at 10:57
  • @StoryTeller it is not required. the value is printing in IDLE –  Oct 22 '13 at 10:57
  • 2
    Explain "not working" – R. Martinho Fernandes Oct 22 '13 at 11:35
  • 3
    Even without the `u`, this works fine when the source encoding is UTF-8. – Fred Foo Oct 22 '13 at 12:13
  • @StoryTeller, you should be careful about intermixing "Unicode" and "UTF-8" - Unicode != UTF-8. `u'€'` and `'€'` are very different. One does not need to use Unicode Python types to hold UTF-8/UTF-16 types – Alastair McCormack Oct 22 '13 at 14:22
  • 1
    @karu, we can see that your string is being correctly decoded from UTF-8: ഇ = \u0d07. I've tested the same thing on my Ubuntu 13.04 box and get a list of multiple items from `wordpunct_tokenize`. `nltk.__version__ = '2.0b9'`. What version are you using? – Alastair McCormack Oct 22 '13 at 14:44
  • And what about nltk? (`nltk.__version__`) – Alastair McCormack Oct 22 '13 at 15:29
  • Can you also paste the result of: print re.search("(\w+)", "ഇത്ഒരുസ്ഥലംആണ്".decode("utf8"), re.U).groups() – Alastair McCormack Oct 22 '13 at 17:42
  • @karu, the result I get from `wordpunct_tokenize` is `[u'\u0d07\u0d24', u'\u0d4d', u'\u0d12\u0d30', u'\u0d41', u'\u0d38', u'\u0d4d', u'\u0d25\u0d32', u'\u0d02', u'\u0d06\u0d23', u'\u0d4d']`. I don't know if this is valid sentence structure in Malayalam. `wordpunct_tokenize` seems to just rely on the Unicode definitions for word spacing. – Alastair McCormack Oct 24 '13 at 11:46
  • What is your _expected_ output? – Games Brainiac Oct 25 '13 at 08:36
  • 1
    You don't have spaces in the input string? There should be spaces for this input to make sense - It should be `ഇത് ഒരു സ്ഥലം ആണ്` (I know the language). – Hari Menon Oct 25 '13 at 08:38
  • So what you are asking, equivalently in english is something to produce `['this', 'is', 'a', 'place']` from `thisisaplace`? – Hari Menon Oct 25 '13 at 08:43
  • 5
    @karu, Ok I get it now. Tokenizing is the wrong word. You need a morphological processor to do this, and I think morphological processors for indian languages is an area under active research. You can try searching 'malayalam language morphology nlp' in google to get started. You might also try rephrasing the question to focus on the morphology side rather than the unicode, because unicode is not the problem here, and people are getting distracted by that bit. – Hari Menon Oct 25 '13 at 09:03
  • So pity if that 100 bounty is gone just like that because there is apparently no answer. Probably @HariShankar can convert the last comment into answer so OP can give the bounty to you? – justhalf Oct 25 '13 at 09:11

6 Answers6

21

After a crash course of the language from wikipedia (http://en.wikipedia.org/wiki/Malayalam), there are some issues in your question and the tools you've requested for your desired output.

Conflated Task

Firstly, the OP conflated the task of morphological analysis, segmentation and tokenization. Often there is a fine distinction especially for aggluntinative languages such as Turkish/Malayalam (see http://en.wikipedia.org/wiki/Agglutinative_language).

Agglutinative NLP and best practices

Next, I don't think tokenizer is appropriate for Malayalam, an agglutinative language. One of the most studied aggluntinative language in NLP, Turkish have adopted a different strategy when it comes to "tokenization", they found that a full blown morphological analyzer is necessary (see http://www.denizyuret.com/2006/11/turkish-resources.html, www.andrew.cmu.edu/user/ko/downloads/lrec.pdf‎).

Word Boundaries

Tokenization is defined as the identification of linguistically meaningful units (LMU) from the surface text (see Why do I need a tokenizer for each language?) And different language would require a different tokenizer to identify the word boundary of different languages. Different people have approach the problem for finding word boundary different but in summary in NLP people have subscribed to the following:

  1. Agglutinative Languages requires a full blown morphological analyzer trained with some sort of language models. There is often only a single tier when identifying what is token and that is at the morphemic level hence the NLP community had developed different language models for their respective morphological analysis tools.

  2. Polysynthetic Languages with specified word boundary has the choice of a two tier tokenization where the system can first identify an isolated word and then if necessary morphological analysis should be done to obtain a finer grain tokens. A coarse grain tokenizer can split a string using certain delimiter (e.g. NLTK's word_tokenize or punct_tokenize which uses whitespaces/punctuation for English). Then for finer grain analysis at morphemic level, people would usually use some finite state machines to split words up into morpheme (e.g. in German http://canoo.net/services/WordformationRules/Derivation/To-N/N-To-N/Pre+Suffig.html)

  3. Polysynthetic Langauges without specified word boundary often requires a segmenter first to add whitespaces between the tokens because the orthography doesn't differentiate word boundaries (e.g. in Chinese https://code.google.com/p/mini-segmenter/). Then from the delimited tokens, if necessary, morphemic analysis can be done to produce finer grain tokens (e.g. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Often this finer grain tokens are tied with POS tags.

The answer in brief to OP's request/question, the OP had used the wrong tools for the task:

  • To output tokens for Malayalam, a morphological analyzer is necessary, simple coarse grain tokenizer in NLTK would not work.
  • NLTK's tokenizer is meant to tokenize polysynthetic Languages with specified word boundary (e.g. English/European languages) so it is not that the tokenizer is not working for Malayalam, it just wasn't meant to tokenize aggluntinative languages.
  • To achieve the output, a full blown morphological analyzer needs to be built for the language and someone had built it (aclweb.org/anthology//O/O12/O12-1028.pdf‎), the OP should contact the author of the paper if he/she is interested in the tool.
  • Short of building a morphological analyzer with a language model, I encourage the OP to first spot for common delimiters that splits words into morphemes in the language and then perform the simple re.split() to achieve a baseline tokenizer.
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Nice explication, but I wouldn't diagnose "wrong tools for the task" without knowing what task the OP is actually after. A "token" is a unit of whatever granularity one is interested in; the nltk even "tokenizes" text into sentences. The OP said he/she is interested in word boundaries. – alexis Oct 31 '13 at 13:07
  • 1
    `str.split()` doesn't work with multiple delimeters, so I would suggest `re.split()` – Ramchandra Apte Nov 01 '13 at 07:28
4

A tokenizer is indeed the right tool; certainly this is what the NLTK calls them. A morphological analyzer (as in the article you link to) is for breaking words into smaller parts (morphemes). But in your example code, you tried to use a tokenizer that is appropriate for English: It recognizes space-delimited words and punctuation tokens. Since Malayalam evidently doesn't indicate word boundaries with spaces, or with anything else, you need a different approach.

So the NLTK doesn't provide anything that detects word boundaries for Malayalam. It might provide the tools to build a decent one fairly easily, though.

The obvious approach would be to try dictionary lookup: Try to break up your input into strings that are in the dictionary. But it would be harder than it sounds: You'd need a very large dictionary, you'd still have to deal with unknown words somehow, and since Malayalam has non-trivial morphology, you may need a morphological analyzer to match inflected words to the dictionary. Assuming you can store or generate every word form with your dictionary, you can use an algorithm like the one described here (and already mentioned by @amp) to divide your input into a sequence of words.

A better alternative would be to use a statistical algorithm that can guess where the word boundaries are. I don't know of such a module in the NLTK, but there has been quite a bit of work on this for Chinese. If it's worth your trouble, you can find a suitable algorithm and train it to work on Malayalam.

In short: The NLTK tokenizers only work for the typographical style of English. You can train a suitable tool to work on Malayalam, but the NLTK does not include such a tool as far as I know.

PS. The NLTK does come with several statistical tokenization tools; the PunctSentenceTokenizer can be trained to recognize sentence boundaries using an unsupervised learning algorithm (meaning you don't need to mark the boundaries in the training data). Unfortunately, the algorithm specifically targets the issue of abbreviations, and so it cannot be adapted to word boundary detection.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161
3

maybe the Viterbi algorithm could help?

This answer to another SO question (and the other high-vote answer) could help: https://stackoverflow.com/a/481773/583834

Community
  • 1
  • 1
arturomp
  • 28,790
  • 10
  • 43
  • 72
3

It seems like your space is the unicode character u'\u0d41'. So you should split normally with str.split().

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

x = 'ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8')
y = x.split(u'\u0d41')
print " ".join(y)

[out]:

ഇത ഒര സ്ഥാലമാണ്`
alvas
  • 115,346
  • 109
  • 446
  • 738
  • This method is called rule based. It will not work, as language is of rich morphology –  Oct 30 '13 at 08:55
1

I tried the following:

# encoding=utf-8

import nltk
cheese = nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8'))
for var in cheese:
    print var.encode('utf8'),

And as output, I got the following:

ഇത ു ഒര ു സ ് ഥ ാ ലമ ാ ണ ്

Is this anywhere close to the output that you want, I'm a little in the dark here, since its difficult to get this right without understanding the language.

Games Brainiac
  • 80,178
  • 33
  • 141
  • 199
  • The [module documentation](http://nltk.org/api/nltk.tokenize.html) explicitly warns against passing utf8: You must use real unicode strings, not utf8 bytes. (See second "Caution"). You are getting nonsense results. – alexis Oct 29 '13 at 11:32
0

Morphological analysis example

from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")

Gives

[('കേരളം<np><genitive>', 179)]

url: mlmorph

if you using anaconda then: install git in anaconda prompt

conda install -c anaconda git

then clone the file using following command:

git clone https://gitlab.com/smc/mlmorph.git
Abhijith M
  • 743
  • 5
  • 5