EDIT: I cannot believe that BeautifullSoup actually cannot parse HTML properly. Actually i maybe do something wrong, but if I do not this is a really amateurish module.
I am trying to get text from web but i am unable to do so as i am always getting some strange characters in the most of sentences. I never get a sentence that containt words such as "isn't' correctly.
useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
#get paragraphs
soup = BeautifulSoup(html)
textList = soup.find_all('p')
mytext = ""
for par in textList:
if len(str(par))<2000:
print par
mytext +=" " + str(par)
print "the text is ", mytext
The result contains some strange characters:
The plural of “comedo� is comedomes�.</p>
Surprisingly, the visible black head isn’t caused by dirt
Obviously i want to get isn't instead of isn’t. What should i do?
tags.
– Brana Feb 28 '14 at 11:01