I have been trying to create a web crawler to scrape data from a website called Baseball Reference. When defining my crawler I realized that the different players have a unique id at the end of their URL containing the first 6 letters of their last name, three zeroes and the first 3 letters of their first name.
I have a pandas dataframe already containing columns 'first' and 'last' containing each players first and last names along with a lot of other data that i downloaded from this same website.
my def for my crawler function is as follows so far:
def bbref_crawler(ID):
url = 'https://www.baseball-reference.com/register/player.fcgi?id=' + str(ID)
source_code = requests.get(url)
page_soup = soup(source_code.text, features='lxml')
And the code that I have so far trying to obtain the player id's is as follows:
for x in nwl_offense:
while len(nwl_offense['last']) > 6:
id_last = len(nwl_offense['last']) - 1
while len(nwl_offense['first']) > 3:
id_first = len(nwl_offense['first']) - 1
nwl_offense['player_id'] = (str(id_first) + '000' + str(id_last))
When I run the for / while loop it just never stops running and I am not sure how else to go about achieving the goal I set out for of automating the player id into another column of that dataframe, so i can easily use the crawler to obtain more information on the players that I need for a project.
This is what the first 5 rows of the dataframe, nwl_offense look like:
print(nwl_offense.head())
Rk Name Age G ... WRC+ WRC
WSB OWins
0 1.0 Brian Baker 20.0 14.0 ... 733.107636 2.007068 0.099775 0.189913
1 2.0 Drew Beazley 21.0 46.0 ... 112.669541 29.920766 -0.456988 2.655892
2 3.0 Jarrett Bickel 21.0 33.0 ... 85.017293 15.245547 1.419822 1.502232
3 4.0 Nate Boyle 23.0 21.0 ... 1127.591556 1.543534 0.000000 0.139136
4 5.0 Seth Brewer* 22.0 12.0 ... 243.655365 1.667671 0.099775 0.159319