0

I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.

from nltk.corpus import stopwords
import nltk
import json

with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
    data = json.loads(f.readline())

tweets = []
for sentence in data:
    tokens = nltk.wordpunct_tokenize(sentence['text'])

    type(tokens)

    text = nltk.Text(tokens)
    type(text)
    words = [w.lower() for w in text if w.isalpha() and w not in 
                    stopwords.words('english') and w is not 'the']
    s = " "
    useful_sentence = s.join(words)
    tweets.append(useful_sentence)

print(tweets)

I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.

Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?

titusAdam
  • 779
  • 1
  • 16
  • 35

3 Answers3

1

Beware of the processing order.

Here are two test strings for you:

THIS THE REMAINS.

this the is removed

Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.

The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thanks for your reply! What do you exactly mean? Do you mean that wordpunct_tokenize needs to be performed after lower? – titusAdam Mar 10 '18 at 20:49
  • And do you have any recommendations that would be better suited for clustering analysis for twitter?:) – titusAdam Mar 10 '18 at 20:55
0

Have you tried lowering w in check?

words = [w.lower() for w in text if w.isalpha() and w.lower() not in 
                    stopwords.words('english') and w.lower() is not 'the']
Piotr Banaś
  • 198
  • 1
  • 8
0

is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.

In your case, use the != operator or the negation of == to compare with the string "the".

See also: Is there a difference between `==` and `is` in Python?

Jeronimo
  • 2,268
  • 2
  • 13
  • 28