0

I am working with lemmatizers in python and it is interesting. I am testing how it works with word like 'does', 'did','are', 'is'. I got the right base word but I noticed the "u'" it stick in from of those words

from nltk.stem.wordnet import WordNetLemmatizer
tokens = ['did', 'does', 'do', "doesn't",'are', 'is', 'splendid']

lemm = WordNetLemmatizer()
tokens2 = [lemm.lemmatize(i, 'v') for i in tokens]

print tokens
print tokens2

Output:

['did', 'does', 'do', "doesn't", 'are', 'is', 'splendid']

[u'do', u'do', 'do', "doesn't", u'be', u'be', 'splendid']

How can I get rid of "u'" so that it reads

['do', 'do', 'do', "doesn't", 'be', 'be', 'splendid']

Thank you very much

Makio21
  • 13
  • 4
  • 1
    The 'u' denotes it is a `unicode` string. Here is link to a fuller answer to this question http://stackoverflow.com/questions/11279331/what-does-the-u-symbol-mean-in-front-of-string-values – JAB Jan 27 '15 at 04:38

1 Answers1

0

The OP asks How can I get rid of "u'".

Answer: switch to Python 3. These are Unicode strings (for obvious reasons of internationalizations!). In Python 2, default strings were byte ones (usually ASCII, very US-only) so the repr of actual text strings (known as unicode objects back then) used a u prefix to clarify what they were.

In Python 3, priorities are at last the right way 'round. "Normal" strings, texts, are known to be unicode ones, so no u is needed and none is displayed; it's byte strings, should you use any, that get a clarifying prefix (b, standing of course for "bytes").

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • does that mean I have the right answer and I just need to switch to Python 3 to get rid of "u'"? – Makio21 Jan 27 '15 at 04:58
  • @Makio21, yep, though I believe you may need to upgrade to NLTK 3 for Python 3 support, it's worth it for other improvements too. I taught about the original Nltk in a guest seminar at Stanford but had to apologize for several dubious design choices... I'd be much prouder teaching Nltk 3 even though I had no part in its making (my background in computational linguistics is ancient and I haven't published in the field in decades!-). – Alex Martelli Jan 27 '15 at 05:04