How do I get rid of characters like ' that appear instead of apostrophes?

Question

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules.

response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
responseString = str(soup)

coarseExpression = re.compile('<div class="sodatext">[\n]*.*[\n]*</div>')
coarseResult = coarseExpression.findall(responseString)

fineExpression = re.compile('<[^>]*>')
fineResult = []

for coarse in coarseResult:
    fine = fineExpression.sub('', coarse) 
    #print(fine)
    fineResult.append(fine)

Unfortunately, characters like apostrophes appear in a corrupted manner like so - &#x27 ; Is there a way to avoid this? Or a way to replace them easily?

That's not corrupted, that's the HTML/XML character entity for an apostrophe (http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). You could always decode such entities back to their ASCII equivalents. (http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python) — dreynold, Dec 22 '11 at 18:00
You are loading a page in BeautifulSoup **JUST TO REGEX IT!?** Why are you doing this awful, awful thing?! — Francis Avila, Dec 22 '11 at 18:12
@FrancisAvila, I'm still feeling my way around Python. Could you tell me a better way? — nindalf, Dec 24 '11 at 14:15
Use BeautifulSoup to search or walk through the HTML tree and get what you need. That's why it exists in the first place! Read the BeautifulSoup documentation. — Francis Avila, Dec 24 '11 at 15:43
Open a new question with what you are trying to do as a whole. I suspect you are fixated on using regexes when in reality that is exactly the *wrong* tool to accomplish your task. — Francis Avila, Dec 25 '11 at 02:31

score 5 · Answer 1 · answered Dec 22 '11 at 18:08

5

The following BeautifulSoup documentation on entity conversion should be what you're looking for:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion

answered Dec 22 '11 at 18:08

Andrew Clark

202,379
35
273
306

2

Just to point out that BS can't decode hex encoded entities (`'`), but it works well with decimal encoded entities (`'`). So, OP needs to convert them beforehand. – Avaris Dec 22 '11 at 18:20
@Avaris That sounds like a bug, or at least a missing feature, in BS, TBH. – Karl Knechtel Dec 22 '11 at 21:06

How do I get rid of characters like ' that appear instead of apostrophes?

1 Answers1