Training nltk tagger with Indian POS data

Question

I am trying to train nltk trainer with indian corpus. I mainly targetting telugu.pos

I followed http://nltk-trainer.readthedocs.io/en/latest/train_tagger.html and trained. Here is the snapshot

When I tried to test it with telugu text. నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ. which is in English My name is Karim. I love Indian food.. It gives this error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

Am I doning somethign wrong?

Edit

I edited the text

sent = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.".decode('utf-8')

Now it gives result like

>>> text = nltk.word_tokenize(sent)
>>> text
[u'\u0c28\u0c3e', u'\u0c2a\u0c47\u0c30\u0c41', u'\u0c15\u0c30\u0c40\u0c02', u'\u0c09\u0c02\u0c26\u0c3f', u'.', u'\u0c28\u0c47\u0c28\u0c41', u'\u0c2d\u0c3e\u0c30\u0c24', u'\u0c06\u0c39\u0c3e\u0c30', u'\u0c2a\u0c4d\u0c30\u0c47\u0c2e', u'.']
>>> nltk.pos_tag(text)
[(u'\u0c28\u0c3e', 'JJ'), (u'\u0c2a\u0c47\u0c30\u0c41', 'NNP'), (u'\u0c15\u0c30\u0c40\u0c02', 'NNP'), (u'\u0c09\u0c02\u0c26\u0c3f', 'NNP'), (u'.', '.'), (u'\u0c28\u0c47\u0c28\u0c41', 'VB'), (u'\u0c2d\u0c3e\u0c30\u0c24', 'JJ'), (u'\u0c06\u0c39\u0c3e\u0c30', 'NNP'), (u'\u0c2a\u0c4d\u0c30\u0c47\u0c2e', 'NNP'), (u'.', '.')]

How can I print this content into original language?

it seems like Wiktor Stribiżew is right, try to change your terminal $LANG to utf-8 — Nathan McCoy, Nov 17 '16 at 12:46
Thanks it helped. I updated my question. Still problem persist in result. I appreciate your help. @WiktorStribiżew — user123, Nov 17 '16 at 12:49
Encode each result as UTF8. Play with `.encode()`/`decode()` to see the difference. — Wiktor Stribiżew, Nov 17 '16 at 12:59
Something like `s = nltk.pos_tag(text)` and `print([(x.encode('utf-8'), y) for x,y in s])` — Wiktor Stribiżew, Nov 17 '16 at 13:40

Training nltk tagger with Indian POS data

0 Answers0