Im trying to do a word count of words in a body of text using NLTK. Im reading in the text file and trying to convert to lowercase, delete punctuation, and tokenize. Then remove stop words, then count most common words. However, I'm getting the following error:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Here's my code:
import nltk
import string
from nltk.corpus import stopwords
from collections import Counter
def get_tokens():
with open('/Users/user/Code/abstract/data/Training(3500)/3500_Response_Tweets. txt', 'r') as r_tweets:
text = r_tweets.read()
lowers = text.lower()
#remove the punctuation using the character deletion step of translate
no_punctuation = lowers.translate(None, string.punctuation)
tokens = nltk.word_tokenize(no_punctuation)
return tokens
tokens = get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(100)
aswell as the warning, my output looks like:
[('so', 268), ('\xe2\x80\x8e\xe2\x80\x8fi', 231), ('like', 192), ('know', 157), ('dont', 137), ('get', 125), ('im', 122), ('would', 118), ('\xe2\x80\x8e\xe2\x80\x8fbut', 118), ('\xe2\x80\x8e\xe2\x80\x8foh', 114), ('right', 113), ('good', 105), ('\xe2\x80\x8e\xe2\x80\x8fyeah', 95), ('sure', 94), ('one', 92),
Traceback error when using codecs.open:
Traceback (most recent call last):
File "tfidf.py", line 16, in <module>
tokens = get_tokens()
File "tfidf.py", line 12, in get_tokens
no_punctuation = lowers.translate(None, string.punctuation)
TypeError: translate() takes exactly one argument (2 given)