0

I have been working on a project that scrapes data from colligate baseball websites in order to create a database that will automatically update itself during the season. most universitys baseball statistics pages have a table set up like this one: https://dupanthers.com/sports/baseball/stats/2021?path=baseball . I have been able to create a scraper to get all of the data that I need from tables like this. However, there are some teams out there whos websites look like this https://www.chargerathletics.com/sports/bsb/2020-21/teams/dominicanny?view=lineup&r=0&pos= . I have not been able to succcessfully scrape this kind of table and i cannot figure out why it will not work. My code for this is below:

   def parse_row(row) : 
        return[str(x.string) for x in row.find_all('td')]

    sv_page = requests.get('https://www.chargerathletics.com/sports/bsb/2020-21/teams/dominicanny?view=lineup&r=0&pos= ')

    sv_soup = soup(sv_page.text, features = 'lxml')

    sv_rows = sv_soup.find_all('tr')

    lopr_sv = [parse_row(row) for row in sv_rows]

    sv_df = pd.DataFrame(lopr_sv)

After running this code with BeautifulSoup, Pandas in python and I print the 'soup' of the HTML I get this mess that always tells me to try later or request access:

{<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>ERROR: The request could not be satisfied</title>
</head><body>
<h1>403 ERROR</h1>
<h2>The request could not be satisfied.</h2>
<hr noshade="" size="1px"/>
Request blocked.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<br clear="all"/>
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<br clear="all"/>
<hr noshade="" size="1px"/>
<pre>
Generated by cloudfront (CloudFront)
Request ID: 9UFYpRiwVca0PJQBM-e58qRKGiJ9qJson0T1XTwVqwSsZU7HoumG6g==
</pre>
<address>
</address>
</body></html>}

And I am just not sure how to go about getting around this issue. Thanks.

Jensen_ray
  • 81
  • 2
  • 10

1 Answers1

0

You can set headers like the following so that you'll not get 403

In [12]: headers = {'User-Agent': '...'}
Out[12]: {'User-Agent': '...'}

In [13]: sv_page = requests.get('https://www.chargerathletics.com/sports/bsb/2020-21/teams/dominicanny?view=lineup&r=0&pos= ', headers=headers)

In [14]: sv_page
Out[14]: <Response [200]>

Chrome uses the following header for user-agent

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36

You can refer Python requests. 403 Forbidden

reddy nishanth
  • 396
  • 6
  • 11