I'm trying to follow this tutorial to learn about web scraping. Because I'm using Python3, I've been playing around with urllib
rather than urllib2
to try and request the URL correctly:
from urllib import request
# tried import urllib
# tried import urllib.request
url = "http://www.bloomberg.com/quote/SPX:IND"
raw_html = request.urlopen(url)
Nothing seemed to open the URL correctly, and I would get this error:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed.
I found a potential solution but nothing in the post mentions an error like that.
Ultimately, I really want to use the Python requests library.
import requests
url = "http://www.bloomberg.com/quote/SPX:IND"
raw_html = requests.get(url)
# get in BeautifulSoup format
processed_html = BeautifulSoup(raw_html.content, "html.parser")
# print('processed_html = ', processed_html)
h1 = processed_html.findAll("h1")
print('h1 = ', h1)
The problem is that I would only get the "Bloomberg" h1 tag back, but there are other h1 tags on the web page. When I look at processed_html
, some of the tags and classes aren't there.
I would really love a solution to the requests library problem, but any help or direction is appreciated.