Python website scraping python and parsing data

Question

I'm a Python beginner and I am having trouble scraping a webpage and displaying specific text from the page.

I know my problem lies within the encoding as I have been reading unicode type and have seen other newbies having the exact same issue.

For example lets say I wanted to scrape www.amazon.com this is the code I have

import pycurl
import cStringIO
from bs4 import BeautifulSoup

buf = cStringIO.StringIO()

curl = pycurl.Curl()
curl.setopt(curl.URL, 'http://www.amazon.com')
curl.setopt(curl.WRITEFUNCTION, buf.write)
curl.perform()

result = buf.getvalue()
result = unicode(result, "ascii", errors="ignore")
buf.close()

soup = BeautifulSoup(result)
print soup.get_text()

This returns the amazon web page to the result variable. But I get the annoying error when trying to use the beautifulsoup get_text() method:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 25790: ordinal not in range(128)

How do I ensure / decode the entire results of the contents returned within my curl request.

How this is python 3 and you have `print` as keyword? – Paulo Bu Feb 13 '14 at 22:02 — Paulo Bu, Feb 13 '14 at 22:02

score 4 · Answer 1 · edited May 23 '17 at 12:04

4

You might want to use requests instead, its simpler and cleaner and AFAIK avoids the encoding issue.

from bs4 import BeautifulSoup
import requests

resp = requests.get('http://www.amazon.com')

bsoup = BeautifulSoup(resp.text)
print(bsoup.get_text())

There are reasons to use CURL, but requests is simpler and easier in most cases and your situation doesn't look like an exception based on what you describe.

EDIT: to resolve the unicode error, try explicitly encoding the string as utf-8 (as per this SO question):

encoded = resp.text.encode('utf-8')
bsoup = BeautifulSoup(encoded)

edited May 23 '17 at 12:04

Community

1
1

answered Feb 13 '14 at 22:16

Moritz

4,565
2
23
21

Unfortunately I still get the encoding error. `UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 25921: ordinal not in range(128)` – Ciaran Feb 14 '14 at 07:14

Python website scraping python and parsing data

1 Answers1