I have recently started a project on Hindi data processing. I have tried executing certain below code but have not got the expected output.
e = u"पूर्ण प्रतिबंध हटाओ : इराक"
tokens=nltk.word_tokenize(e)
from nltk import pos_tag
print tokens
tag = nltk.pos_tag(tokens)
print tag
The output I have obtained is shown below:
[u'\u092a\u0942\u0930\u094d\u0923', u'\u092a\u094d\u0930\u0924\u093f\u092c\u0902\u0927', u'\u0939\u091f\u093e\u0913', u':', u'\u0907\u0930\u093e\u0915']
[(u'\u092a\u0942\u0930\u094d\u0923', 'NN'), (u'\u092a\u094d\u0930\u0924\u093f\u092c\u0902\u0927', '``'), (u'\u0939\u091f\u093e\u0913', ':'), (u':', ':'), (u'\u0907\u0930\u093e\u0915', ':')]
My query is tagging the first word of my input as noun and rest are tagged incorrectly. The same query gives a correct output for English data.
What is it that I'm doing wrong? Is there any specific function that I have to use for tagging d Hindi data.
Thank you for your help.