I am trying to train nltk trainer with indian corpus. I mainly targetting telugu.pos
I followed http://nltk-trainer.readthedocs.io/en/latest/train_tagger.html and trained. Here is the snapshot
When I tried to test it with telugu text. నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.
which is in English My name is Karim. I love Indian food.
. It gives this error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Am I doning somethign wrong?
Edit
I edited the text
sent = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.".decode('utf-8')
Now it gives result like
>>> text = nltk.word_tokenize(sent)
>>> text
[u'\u0c28\u0c3e', u'\u0c2a\u0c47\u0c30\u0c41', u'\u0c15\u0c30\u0c40\u0c02', u'\u0c09\u0c02\u0c26\u0c3f', u'.', u'\u0c28\u0c47\u0c28\u0c41', u'\u0c2d\u0c3e\u0c30\u0c24', u'\u0c06\u0c39\u0c3e\u0c30', u'\u0c2a\u0c4d\u0c30\u0c47\u0c2e', u'.']
>>> nltk.pos_tag(text)
[(u'\u0c28\u0c3e', 'JJ'), (u'\u0c2a\u0c47\u0c30\u0c41', 'NNP'), (u'\u0c15\u0c30\u0c40\u0c02', 'NNP'), (u'\u0c09\u0c02\u0c26\u0c3f', 'NNP'), (u'.', '.'), (u'\u0c28\u0c47\u0c28\u0c41', 'VB'), (u'\u0c2d\u0c3e\u0c30\u0c24', 'JJ'), (u'\u0c06\u0c39\u0c3e\u0c30', 'NNP'), (u'\u0c2a\u0c4d\u0c30\u0c47\u0c2e', 'NNP'), (u'.', '.')]
How can I print this content into original language?