Linked Removing escaped entities from a String in Python
My code is reading in a big csv file of tweets and parsing it to two dictionaries (depending on the sentiment of the tweets). I then create a new dictionary and unescape everything using HTML parser before using the translate() method to remove all punctuation from the text.
Finally, I am trying to only keep words that are greater than length = 3.
This is my code:
tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
text = HTMLParser.HTMLParser().unescape(text.decode('ascii'))
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
shortenedText = [e.lower() and e.translate(remove_punctuation_map) for e in text.split() if len(e) >= 3 and not e.startswith(('http', '@')) ]
print shortenedText
What I'm finding however is that whilst most of what I want is being done, I am still getting words that are of length two (not length one however) and I'm getting a few blank entries in my dictionary.
For example:
(: !!!!!! - so I wrote something last week
* enough said *
.... Do I need to say it?
Produces:
[u'', u'wrote', u'something', u'last', u'week']
[u'enough', u'said']
[u'', u'need', u'even', u'say', u'it']
What's wrong with my code? How can I remove all words less than length two including blank entries?