1

Linked Removing escaped entities from a String in Python

My code is reading in a big csv file of tweets and parsing it to two dictionaries (depending on the sentiment of the tweets). I then create a new dictionary and unescape everything using HTML parser before using the translate() method to remove all punctuation from the text.
Finally, I am trying to only keep words that are greater than length = 3.
This is my code:

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text.decode('ascii'))
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    shortenedText = [e.lower() and e.translate(remove_punctuation_map) for e in text.split() if len(e) >= 3 and not e.startswith(('http', '@')) ]
    print shortenedText

What I'm finding however is that whilst most of what I want is being done, I am still getting words that are of length two (not length one however) and I'm getting a few blank entries in my dictionary.
For example:

(: !!!!!! - so I wrote something last week
* enough said *
.... Do I need to say it?

Produces:

[u'', u'wrote', u'something', u'last', u'week']
[u'enough', u'said']
[u'', u'need', u'even', u'say', u'it']

What's wrong with my code? How can I remove all words less than length two including blank entries?

Community
  • 1
  • 1
Andrew Martin
  • 5,619
  • 10
  • 54
  • 92

1 Answers1

4

I think your problem is that when you test whether len(e) >= 3, e still contains punctuation, so "it?" is not filtered out. Maybe do it in two steps? Clean e of punctuation, then filter for size?

Something like

cleanedText = [e.translate(remove_punctuation_map).lower() for e in text.split() if not e.startswith(('http', '@')) ]
shortenedText = [e for e in cleanedText if len(e) >= 3]
Brionius
  • 13,858
  • 3
  • 38
  • 49