I have been working on a project that scrapes data from colligate baseball websites in order to create a database that will automatically update itself during the season. most universitys baseball statistics pages have a table set up like this one: https://dupanthers.com/sports/baseball/stats/2021?path=baseball . I have been able to create a scraper to get all of the data that I need from tables like this. However, there are some teams out there whos websites look like this https://www.chargerathletics.com/sports/bsb/2020-21/teams/dominicanny?view=lineup&r=0&pos= . I have not been able to succcessfully scrape this kind of table and i cannot figure out why it will not work. My code for this is below:
def parse_row(row) :
return[str(x.string) for x in row.find_all('td')]
sv_page = requests.get('https://www.chargerathletics.com/sports/bsb/2020-21/teams/dominicanny?view=lineup&r=0&pos= ')
sv_soup = soup(sv_page.text, features = 'lxml')
sv_rows = sv_soup.find_all('tr')
lopr_sv = [parse_row(row) for row in sv_rows]
sv_df = pd.DataFrame(lopr_sv)
After running this code with BeautifulSoup, Pandas in python and I print the 'soup' of the HTML I get this mess that always tells me to try later or request access:
{<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>ERROR: The request could not be satisfied</title>
</head><body>
<h1>403 ERROR</h1>
<h2>The request could not be satisfied.</h2>
<hr noshade="" size="1px"/>
Request blocked.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<br clear="all"/>
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<br clear="all"/>
<hr noshade="" size="1px"/>
<pre>
Generated by cloudfront (CloudFront)
Request ID: 9UFYpRiwVca0PJQBM-e58qRKGiJ9qJson0T1XTwVqwSsZU7HoumG6g==
</pre>
<address>
</address>
</body></html>}
And I am just not sure how to go about getting around this issue. Thanks.