Removing escaped entities from a String in Python

Question

I've a huge csv file of tweets. I read them both into the computer and stored them in two separate dictionaries - one for negative tweets, one for positive. I wanted to read the file in and parse it to a dictionary whilst removing any punctuation marks. I've used this code:

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

It's all worked well barring one minor problem. The huge csv file I've downloaded has unfortunately changed some of the punctuation. I'm not sure what this is called so can't really google it, but effectively some sentence might begin:

"ampampFightin"
"&quot;The truth is out there"
"&altThis is the way I feel"

Is there a way to get rid of all these? I notice the latter two begin with an ampersand - will a simple search for that get rid of it (the only reason I'm asking and not doing is because there's too many tweets for me to manually check)

`"` is a HTML escaped entity. You are looking to un-escape these. — Martijn Pieters, Aug 09 '13 at 12:24
Anything that is missing the `&` or `;` characters is malformed and is not likely to be recoverable. — Martijn Pieters, Aug 09 '13 at 12:25
http://www.htmlhelp.com/reference/html40/entities/special.html Here is a list of all of them in HTML 4.0. — rlms, Aug 09 '13 at 12:25

score 4 · Accepted Answer · edited May 23 '17 at 11:43

4

First, unescape HTML entities, then remove punctuation chars:

import HTMLParser

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text)
    shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

Here's an example, how unescape works:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape("&quot;The truth is out there")
u'"The truth is out there'

UPD: the solution to UnicodeDecodeError problem : use text.decode('utf8'). Here's a good explanation why do you need to do this.

edited May 23 '17 at 11:43

Community

1
1

answered Aug 09 '13 at 12:26

alecxe

462,703
120
1,088
1,195

And to unescape them, should I do a search for anything beginning with an ampersand? – Andrew Martin Aug 09 '13 at 12:26
Nope, just give it a text and it'll unescape entities that it will find in the text. – alecxe Aug 09 '13 at 12:28
Thanks for this, but when I run it I get this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 28: ordinal not in range(128) – Andrew Martin Aug 09 '13 at 12:30
3

`import html.parser; html.parser.HTMLParser().unescape(text)` for Python 3. – rlms Aug 09 '13 at 12:31
@Andrew: you can force a string into a particular encoding with str.encode() -- in your case maybe text.encode('us-ascii') ? – Mayur Patel Aug 09 '13 at 12:33
@AndrewMartin I've updated the answer, please check. – alecxe Aug 09 '13 at 12:37
I couldn't get it working with lower case html parser or text.encode ('us-ascii'). I DID seem to have some progress using text.decode('utf-8'), but now I get a different error flagged up. A typeError: translate() takes exactly one argument (2 given). – Andrew Martin Aug 09 '13 at 12:40
I haven't changed the original shortenedText= line so I'm not sure why this is happening now. – Andrew Martin Aug 09 '13 at 12:41
I fixed this problem by following: http://stackoverflow.com/questions/11692199/string-translate-with-unicode-data-in-python – Andrew Martin Aug 09 '13 at 12:59

Removing escaped entities from a String in Python

1 Answers1

Linked