Code not removing desired values from dictionary

Question

Linked Removing escaped entities from a String in Python

My code is reading in a big csv file of tweets and parsing it to two dictionaries (depending on the sentiment of the tweets). I then create a new dictionary and unescape everything using HTML parser before using the translate() method to remove all punctuation from the text.
Finally, I am trying to only keep words that are greater than length = 3.
This is my code:

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text.decode('ascii'))
    remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
    shortenedText = [e.lower() and e.translate(remove_punctuation_map) for e in text.split() if len(e) >= 3 and not e.startswith(('http', '@')) ]
    print shortenedText

What I'm finding however is that whilst most of what I want is being done, I am still getting words that are of length two (not length one however) and I'm getting a few blank entries in my dictionary.
For example:

(: !!!!!! - so I wrote something last week
* enough said *
.... Do I need to say it?

Produces:

[u'', u'wrote', u'something', u'last', u'week']
[u'enough', u'said']
[u'', u'need', u'even', u'say', u'it']

What's wrong with my code? How can I remove all words less than length two including blank entries?

Note that `e.lower() and e.translate(remove_punctuation_map)` does not do what you think it does. You probably want `e.lower().translate(remove_punctuation_map)` instead. — Martijn Pieters, Aug 09 '13 at 14:28
`e.lower()` returns the changed string, not alter it in place. Provided `e` is is not empty, `e.lower()` is merely used to as a boolean test and only `e.translate(...)` is returned. — Martijn Pieters, Aug 09 '13 at 14:29
Of course, makes sense. Thank you, have changed my code accordingly. — Andrew Martin, Aug 09 '13 at 14:30

Brionius · Accepted Answer · 2013-08-09T14:31:16.633

4

I think your problem is that when you test whether len(e) >= 3, e still contains punctuation, so "it?" is not filtered out. Maybe do it in two steps? Clean e of punctuation, then filter for size?

Something like

cleanedText = [e.translate(remove_punctuation_map).lower() for e in text.split() if not e.startswith(('http', '@')) ]
shortenedText = [e for e in cleanedText if len(e) >= 3]

edited Aug 09 '13 at 14:31

answered Aug 09 '13 at 14:25

Brionius

13,858
3
38
49

And I think `e.lower() and e.translate(remove_punctuation_map)` should be replaced with `e.translate(remove_punctuation_map).lower()`. – Ashwini Chaudhary Aug 09 '13 at 14:28
That's exactly it. As I posted it I did wonder about that, but couldn't figure out how to solve it. This is perfect, cheers. – Andrew Martin Aug 09 '13 at 14:29
@AshwiniChaudhary: Thanks for this as well, have changed my code accordingly – Andrew Martin Aug 09 '13 at 14:30

Code not removing desired values from dictionary

1 Answers1