How to get html from url using Python?

Question

I'm trying to follow this tutorial to learn about web scraping. Because I'm using Python3, I've been playing around with urllib rather than urllib2 to try and request the URL correctly:

from urllib import request
# tried import urllib
# tried import urllib.request

url = "http://www.bloomberg.com/quote/SPX:IND"
raw_html = request.urlopen(url)

Nothing seemed to open the URL correctly, and I would get this error:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed.

I found a potential solution but nothing in the post mentions an error like that.

Ultimately, I really want to use the Python requests library.

import requests

url = "http://www.bloomberg.com/quote/SPX:IND"
raw_html = requests.get(url)

# get in BeautifulSoup format
processed_html = BeautifulSoup(raw_html.content, "html.parser")
# print('processed_html = ', processed_html)
h1 = processed_html.findAll("h1")
print('h1 = ', h1)

The problem is that I would only get the "Bloomberg" h1 tag back, but there are other h1 tags on the web page. When I look at processed_html, some of the tags and classes aren't there.

I would really love a solution to the requests library problem, but any help or direction is appreciated.

Use bs4 selectors https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors — knh190, May 21 '19 at 02:52
Alternative parser is `lxml` https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup — knh190, May 21 '19 at 02:53
what makes you say that there is more than one h1 element? Looking at the source for the link you included a quick Ctrl-F looks like thats the expected result — jcp, May 21 '19 at 03:00
Try get requests with this header `{'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}` — Kamal, May 21 '19 at 07:08
@knh190 The problem is not the BeautifulSoup selector, there just isn't the right h1 tag in the processed_html, and I shouldn't have to overcomplicate things with the lxml parser. — HunterLiu, May 21 '19 at 16:03
@osonuyi If you inspect the page, you'll see an h1 that tells you the name of the stock — HunterLiu, May 21 '19 at 16:04
You are mixing up quite a few different questions, and I suggest you ask them separately. — knh190, May 21 '19 at 16:22
@knh190 I have one main question: how to scrape a web page. I have included all of the solutions that I have tried which is what you're supposed to do for a question... — HunterLiu, May 21 '19 at 17:46
I'd suggest you search for some tutorials on `requests` or `scrapy`. — knh190, May 21 '19 at 17:55
@knh190 I've already tried, but no success, that's why I'm asking a question on stackoverflow... — HunterLiu, May 21 '19 at 18:08
@HunterLiu i was able to reproduce this and the html being returned seems to indicate that the bloomberg server is detecting "unusual behavior" and thus serving a different page than what you are seeing when you visit through your browser. this is the error i see in the HTML "We've detected unusual activity from your computer network. To continue, please click the box below to let us know you're not a robot." Might be helpful to look into Selenium chrome webdriver to circumvent this, but not guaranteed. — jcp, May 23 '19 at 03:22

chitown88 · Answer 1 · 2019-05-23T10:36:51.443

On the tutorial, you should have read:

Scraping Rules

You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.

Now either a) this tutorial was written a while back before the site had updated it's Terms of Service, b) they had acquired the written consent to scrape the site, or c) the tutorial blatantly ignored it's own advice. I will give the benefit of the doubt that the Terms of Service has changed since the article was written 2 years ago or they got written permission, but if you read the site's Terms of Service, you will read:

... You shall not use or attempt to use any “scraper,” “robot,” “bot,” “spider,” “data mining,” “computer code,” or any other automate device, program, tool, algorithm, process or methodology to access, acquire, copy, or monitor any portion of the Service, any data or content found on or accessed through the Service, or any other Service information without the prior express written consent of BLP. You may not forge headers or otherwise manipulate identifiers in order to disguise the origin of any other content.

So I'd suggest find a different site to practice on, it follows the same process.

normally site scraping ethics are layed out in the robots.txt from what i see the bloomberg site is okay for scraping since the user-agent has a wild card designation — jcp, May 23 '19 at 03:27
@osonuyi Then those ethics should only be applied given you have their consent. The Terms of Service **CLEARLY** states it's not okay for scraping without prior express written consent. — chitown88, May 23 '19 at 10:35
yea good point actually- heres a relevant stackexchange on this topic with more detail for future reference: https://law.stackexchange.com/questions/58817/discrepancy-between-robots-txt-and-tos — jcp, Mar 15 '21 at 02:56

How to get html from url using Python?

1 Answers1