5

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules.

response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
responseString = str(soup)

coarseExpression = re.compile('<div class="sodatext">[\n]*.*[\n]*</div>')
coarseResult = coarseExpression.findall(responseString)

fineExpression = re.compile('<[^>]*>')
fineResult = []

for coarse in coarseResult:
    fine = fineExpression.sub('', coarse) 
    #print(fine)
    fineResult.append(fine)

Unfortunately, characters like apostrophes appear in a corrupted manner like so - &#x27 ; Is there a way to avoid this? Or a way to replace them easily?

Community
  • 1
  • 1
nindalf
  • 1,058
  • 2
  • 12
  • 15
  • 7
    That's not corrupted, that's the HTML/XML character entity for an apostrophe (http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). You could always decode such entities back to their ASCII equivalents. (http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python) – dreynold Dec 22 '11 at 18:00
  • 17
    You are loading a page in BeautifulSoup **JUST TO REGEX IT!?** Why are you doing this awful, awful thing?! – Francis Avila Dec 22 '11 at 18:12
  • 1
    @FrancisAvila, I'm still feeling my way around Python. Could you tell me a better way? – nindalf Dec 24 '11 at 14:15
  • 2
    Use BeautifulSoup to search or walk through the HTML tree and get what you need. That's why it exists in the first place! Read the BeautifulSoup documentation. – Francis Avila Dec 24 '11 at 15:43
  • Open a new question with what you are trying to do as a whole. I suspect you are fixated on using regexes when in reality that is exactly the *wrong* tool to accomplish your task. – Francis Avila Dec 25 '11 at 02:31

1 Answers1

5

The following BeautifulSoup documentation on entity conversion should be what you're looking for:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity%20Conversion

Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • 2
    Just to point out that BS can't decode hex encoded entities (`'`), but it works well with decimal encoded entities (`'`). So, OP needs to convert them beforehand. – Avaris Dec 22 '11 at 18:20
  • @Avaris That sounds like a bug, or at least a missing feature, in BS, TBH. – Karl Knechtel Dec 22 '11 at 21:06