Yet another unicode mess in Python

Question

I'm tagging some unicode text with Python NLTK. The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8. Given the input string:

 s = u"The problem isn&#8217;t getting to Huancavelica from Huancayo to the north."

I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:

The/DT problem/NN isn&#8217;t/NN getting/VBG

Instead of:

The/DT problem/NN isn't/VBG getting/VBG

How do I get clean the text from these special characters?

Thanks for any feedback,

Mulone

UPDATE: If I run HTMLParser().unescape(s), I get:

 u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'

In other cases, I still get things like & and  in the text. What do I need to do to translate this into something that NLTK will understand?

Nope, your example input text is transformed to Unicode fully by your code. I don't see any `....;` escapes left. Is your example text what is *returned* by your method? — Martijn Pieters, Apr 11 '13 at 11:02
Actually I'm storing that text in a file, writing in an XML file, and then reading it again, all of which using lxml. — Mulone, Apr 11 '13 at 11:06
Try something like `txt = lec.decode('utf8').encode('latin9')` — f p, Apr 11 '13 at 11:26

score 4 · Accepted Answer · edited May 23 '17 at 11:57

4

This is not an character/Unicode encoding issue. The text you have contains XML/HTML numeric character reference entities, which are markup. Whatever library you're using to parse the file should provide some function to dereference ’ to the appropriate character.

If you're not bound to any library, see Decode HTML entities in Python string?

The resulting string includes a special apostrophe instead of an ascii single-quote. You can just replace it in the result:

In [6]: s = u"isn&#8217;t"

In [7]: print HTMLParser.HTMLParser().unescape(s)
isn’t

In [8]: print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")
isn't

Unescape will take care of the rest of the characters. For example & is the & symbol itself.  is a CR symbol (\r) and can be either ignored or converted into a newline depending on where the original text comes from (old macs used it for newlines)

edited May 23 '17 at 11:57

Community

1
1

answered Apr 11 '13 at 11:55

viraptor

33,322
10
107
191

If I use `HTMLParser().unescape(s)`, I get: `u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'` – Mulone Apr 11 '13 at 12:00
2

And that's fine - that's exactly what the text is. If you print it rather than show the variable in REPL, you will see "isn’t". That isn't the typical ascii apostrophe, but you can replace it with one if needed. – viraptor Apr 11 '13 at 12:04
Check `print HTMLParser.HTMLParser().unescape(s)` -vs- `print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")` – viraptor Apr 11 '13 at 12:05

Yet another unicode mess in Python

1 Answers1