Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python
I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules.
response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
responseString = str(soup)
coarseExpression = re.compile('<div class="sodatext">[\n]*.*[\n]*</div>')
coarseResult = coarseExpression.findall(responseString)
fineExpression = re.compile('<[^>]*>')
fineResult = []
for coarse in coarseResult:
fine = fineExpression.sub('', coarse)
#print(fine)
fineResult.append(fine)
Unfortunately, characters like apostrophes appear in a corrupted manner like so - ' ; Is there a way to avoid this? Or a way to replace them easily?