4

Im trying to do a word count of words in a body of text using NLTK. Im reading in the text file and trying to convert to lowercase, delete punctuation, and tokenize. Then remove stop words, then count most common words. However, I'm getting the following error:

UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

Here's my code:

import nltk
import string
from nltk.corpus import stopwords
from collections import Counter

def get_tokens():
   with     open('/Users/user/Code/abstract/data/Training(3500)/3500_Response_Tweets.    txt', 'r') as r_tweets:
    text = r_tweets.read()
    lowers = text.lower()
    #remove the punctuation using the character deletion step of     translate
    no_punctuation = lowers.translate(None, string.punctuation)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

tokens = get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(100)

aswell as the warning, my output looks like:

[('so', 268), ('\xe2\x80\x8e\xe2\x80\x8fi', 231), ('like', 192), ('know', 157), ('dont', 137), ('get', 125), ('im', 122), ('would', 118), ('\xe2\x80\x8e\xe2\x80\x8fbut', 118), ('\xe2\x80\x8e\xe2\x80\x8foh', 114), ('right', 113), ('good', 105), ('\xe2\x80\x8e\xe2\x80\x8fyeah', 95), ('sure', 94), ('one', 92),

Traceback error when using codecs.open:

Traceback (most recent call last):
  File "tfidf.py", line 16, in <module>
    tokens = get_tokens()
  File "tfidf.py", line 12, in get_tokens
    no_punctuation = lowers.translate(None, string.punctuation)
TypeError: translate() takes exactly one argument (2 given)
dizzle
  • 153
  • 2
  • 7
  • 1
    It appears those tokens are starting with a LTR mark (`u'\u200e'`) and then a RTL mark (`u'\u200f'`), encoded as UTF-8. (I don't know why, but it seems all you need to do for this task is to take them out.) You should make sure of what encoding your file is in (like I said, looks like it might be UTF-8) and decode it appropriately. And then strip those characters if need be. – Dan Getz Mar 23 '16 at 14:38
  • I answered the first question (UnicodeWarning). I would suggest you open a second question for this: "TypeError: translate() takes exactly one argument (2 given)" – guettli Mar 24 '16 at 06:08
  • Possible duplicate of [Python unicode equal comparison failed](https://stackoverflow.com/questions/18193305/python-unicode-equal-comparison-failed) – Alastair Irvine Jul 26 '17 at 10:15

1 Answers1

3

My advice: use io.open('filename.txt', 'r', encoding='utf8'). Then you get nice unicode objects and not ugly byte objects.

This works for Python2 and Python3. See: https://stackoverflow.com/a/22288895/633961

Community
  • 1
  • 1
guettli
  • 25,042
  • 81
  • 346
  • 663
  • Use `io.open('filename.txt', 'r', encoding='utf8')` for `python2`. And `open('filename.txt', 'r', encoding='utf8'))` for `python3`. – alvas Mar 23 '16 at 14:43
  • after using `codecs.open('filename', 'r', encoding="utf-8")`, i get the following error: `LookupError: unknown encoding: unicode users-MBP:tfidf user$ python tfidf.py Traceback (most recent call last): File "tfidf.py", line 16, in tokens = get_tokens() File "tfidf.py", line 12, in get_tokens no_punctuation = lowers.translate(None, string.punctuation) TypeError: translate() takes exactly one argument (2 given)` I'm not very well versed on encoding in python – dizzle Mar 23 '16 at 14:51
  • @dizzle pleas post the tracekback into the question. I looks like there are two errors: LookupError and TypeError. I can't understand the above traceback inside the comment. – guettli Mar 23 '16 at 15:29
  • I've added the traceback. The LookupError was a mistake when I was copying and pasting, sorry. – dizzle Mar 23 '16 at 15:48
  • @guettli https://mail.python.org/pipermail/python-list/2015-March/687124.html and http://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python – alvas Mar 23 '16 at 16:29
  • 1
    @alvas thank you for both links. This was new to me. I updated my answer. – guettli Mar 23 '16 at 20:34
  • 2
    @dizzle: Your traceback is because you're running on Python 3, but calling the `translate` function the way you would on Python 2; the Python 3 `str.translate` has a completely different interface matching Py2's `unicode.translate`, while `bytes.translate` matches Py2's `str.translate` signature. Add a line at top level that does `removepunc = str.maketrans('', '', string.punctuation)` then change the `translate` call to `lowers.translate(removepunc)`. – ShadowRanger Mar 23 '16 at 20:44