I'm a Python beginner and I am having trouble scraping a webpage and displaying specific text from the page.
I know my problem lies within the encoding as I have been reading unicode type and have seen other newbies having the exact same issue.
For example lets say I wanted to scrape www.amazon.com this is the code I have
import pycurl
import cStringIO
from bs4 import BeautifulSoup
buf = cStringIO.StringIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, 'http://www.amazon.com')
curl.setopt(curl.WRITEFUNCTION, buf.write)
curl.perform()
result = buf.getvalue()
result = unicode(result, "ascii", errors="ignore")
buf.close()
soup = BeautifulSoup(result)
print soup.get_text()
This returns the amazon web page to the result variable. But I get the annoying error when trying to use the beautifulsoup get_text() method:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 25790: ordinal not in range(128)
How do I ensure / decode the entire results of the contents returned within my curl request.