I'm tagging some unicode text with Python NLTK. The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8. Given the input string:
s = u"The problem isn’t getting to Huancavelica from Huancayo to the north."
I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:
The/DT problem/NN isn’t/NN getting/VBG
Instead of:
The/DT problem/NN isn't/VBG getting/VBG
How do I get clean the text from these special characters?
Thanks for any feedback,
Mulone
UPDATE: If I run HTMLParser().unescape(s)
, I get:
u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'
In other cases, I still get things like &
and
in the text.
What do I need to do to translate this into something that NLTK will understand?