I'm trying to crawl a very 'right side' website for my research about hate and racism detection, so the content of my test may be offending.
I'm trying to remove some stopwords and punctuation in python and I'm using NLTK but I met a problem of encoding... I'm using python 2.7 and the data come from a file that I fill with article from the website I crawled:
stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
print type(value), value
tokenized_article = nltk.word_tokenize(value.lower())
print tokenized_article
break
And the output look likes: (I add ... to shorten the sample)
<type 'str'> A Negress Bernie ... they’re not going to take it anymore.
['a', 'negress', 'bernie', ... , 'they\u2019re', 'not', 'going', 'to', 'take', 'it', 'anymore', '.']
I don't understand why there is this '\u2019' that shouldn't be there. If someone can tell me how to get ride of it. I tried to encode in UTF-8 but I still got the same problem.