Encoding issue using NLTK

Question

I'm trying to crawl a very 'right side' website for my research about hate and racism detection, so the content of my test may be offending.

I'm trying to remove some stopwords and punctuation in python and I'm using NLTK but I met a problem of encoding... I'm using python 2.7 and the data come from a file that I fill with article from the website I crawled:

stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
    print type(value), value
    tokenized_article = nltk.word_tokenize(value.lower())
    print tokenized_article
    break

And the output look likes: (I add ... to shorten the sample)

<type 'str'>   A Negress Bernie ... they’re not going to take it anymore.

['a', 'negress', 'bernie', ... , 'they\u2019re', 'not', 'going', 'to', 'take', 'it', 'anymore', '.']

I don't understand why there is this '\u2019' that shouldn't be there. If someone can tell me how to get ride of it. I tried to encode in UTF-8 but I still got the same problem.

`\u2019` is the unicode symbol [RIGHT SINGLE QUOTATION MARK](http://unicode.org/cldr/utility/character.jsp?a=2019). If you don't have too many different problem characters, you can simply [fix your strings](http://stackoverflow.com/questions/24358361/removing-u2018-and-u2019-character) — alexis, Dec 01 '16 at 01:07

score 1 · Accepted Answer · answered Nov 30 '16 at 17:03

1

stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
    print type(value), value
    #replace value with ignored handler
    value = value.encode('ascii', 'ignore')
    tokenized_article = nltk.word_tokenize(value.lower())
    print tokenized_article
    break

answered Nov 30 '16 at 17:03

Ari Gold

1,528
11
18

1

Thanks :) I switch 'ignore' with 'replace' other way I would have 'theyre'. And then I can remove the '?' with string.punctuation – mel Nov 30 '16 at 17:18
1

i like your task topic, go ahead – Ari Gold Nov 30 '16 at 17:26
This is not a good advice. Even before processing the text, you should have explicitly crawled the encoding of the site and know this before hand and then set the crawler to the appropriate encoding. If they're all in UTF8, then comparing strings in Python3 makes more sense and cause less pain to you. – alvas Dec 01 '16 at 06:54

Encoding issue using NLTK

1 Answers1