3

I would like to get the data located on this page: https://www.zacks.com/stock/quote/MA

I've tried to do this with Beautiful Soup in Python but I get an error: "[WinError 10054] An existing connection was forcibly closed by the remote host".

Can someone guide me?

from bs4 import BeautifulSoup
import urllib
import re
import urllib.request

url = 'https://www.zacks.com/stock/quote/MA'

r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r, "lxml")
soup

Thanks!

Vasconni
  • 33
  • 1
  • 5

3 Answers3

1

The website is blocking your request, maybe the host allowed no requests without a request header. You can try to simulate a "real" request with the Selenium package.

This is working:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup




options = Options()

options.set_headless(headless=True)

url = 'https://www.zacks.com/stock/quote/MA'

browser = webdriver.Firefox(firefox_options=options)

browser.get(url)

html_source = browser.page_source

soup = BeautifulSoup(html_source, "lxml")

print(soup)

browser.close()
madik_atma
  • 787
  • 10
  • 28
1

Your page is blocking the user-agent python, the user agent is basically "who is doing the request" install the python module fake user-agent and add a header to the request simulating that the request is being made for another one like google chrome, mozilla, etc if you want an specific user-agent i recomend you look at fake-user-agent

With urllib i don't know how you add a header (probably will be with a flag) but i let you here a simple code using the module requests:

import requests
from fake_useragent import UserAgent

ua = UserAgent()
header = {
    "User-Agent": ua.random
}
r = requests.get('https://www.zacks.com/stock/quote/MA', headers=header)
r.text #your html code

After this you can use beatifull soup with r.text like you did:

soup = BeautifulSoup(r.text, "lxml")
soup

EDIT:

Looking a bit if you want do it with urllib you can do this:

 import urllib
 from fake_useragent import UserAgent

 ua = UserAgent()
 q = urllib.Request('https://www.zacks.com/stock/quote/MA')
 q.add_header('User-Agent', ua.random)
 a = urlopen(q).read()
0

Taken from this answer here:

It's fatal. The remote server has sent you a RST packet, which indicates an immediate dropping of the connection, rather than the usual handshake. This bypasses the normal half-closed state transition. I like this description:

"Connection reset by peer" is the TCP/IP equivalent of slamming the phone back on the hook. It's more polite than merely not replying, leaving one hanging. But it's not the FIN-ACK expected of the truly polite TCP/IP converseur."

This is because the User-Agent defined when making the Python Requests is not accepted by the queried site and hence the connection was dropped by the remote web server. Hence the connection reset error that you see. I tried doing a cURL request and it worked fine, so all you have to do is define your User-Agent in the header section. Something like this:

>>> header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
>>> url = 'https://www.zacks.com/stock/quote/MA'
>>> r = requests.get(url, headers=header, verify=False)
>>> soups = BS(r.text,"lxml")
>>> print(soups.prettify())

And then make the required get requests and I'm hoping you'll be good.

Devanshu Misra
  • 773
  • 1
  • 9
  • 28