1

I'm tagging some unicode text with Python NLTK. The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8. Given the input string:

 s = u"The problem isn’t getting to Huancavelica from Huancayo to the north."

I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:

The/DT problem/NN isn’t/NN getting/VBG

Instead of:

The/DT problem/NN isn't/VBG getting/VBG

How do I get clean the text from these special characters?

Thanks for any feedback,

Mulone

UPDATE: If I run HTMLParser().unescape(s), I get:

 u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'

In other cases, I still get things like & and 
 in the text. What do I need to do to translate this into something that NLTK will understand?

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
Mulone
  • 3,603
  • 9
  • 47
  • 69

1 Answers1

4

This is not an character/Unicode encoding issue. The text you have contains XML/HTML numeric character reference entities, which are markup. Whatever library you're using to parse the file should provide some function to dereference ’ to the appropriate character.

If you're not bound to any library, see Decode HTML entities in Python string?

The resulting string includes a special apostrophe instead of an ascii single-quote. You can just replace it in the result:

In [6]: s = u"isn’t"

In [7]: print HTMLParser.HTMLParser().unescape(s)
isn’t

In [8]: print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")
isn't

Unescape will take care of the rest of the characters. For example & is the & symbol itself. 
 is a CR symbol (\r) and can be either ignored or converted into a newline depending on where the original text comes from (old macs used it for newlines)

Community
  • 1
  • 1
viraptor
  • 33,322
  • 10
  • 107
  • 191
  • If I use `HTMLParser().unescape(s)`, I get: `u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'` – Mulone Apr 11 '13 at 12:00
  • 2
    And that's fine - that's exactly what the text is. If you print it rather than show the variable in REPL, you will see "isn’t". That isn't the typical ascii apostrophe, but you can replace it with one if needed. – viraptor Apr 11 '13 at 12:04
  • Check `print HTMLParser.HTMLParser().unescape(s)` -vs- `print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")` – viraptor Apr 11 '13 at 12:05