1

I have been trying to create a web crawler to scrape data from a website called Baseball Reference. When defining my crawler I realized that the different players have a unique id at the end of their URL containing the first 6 letters of their last name, three zeroes and the first 3 letters of their first name.

I have a pandas dataframe already containing columns 'first' and 'last' containing each players first and last names along with a lot of other data that i downloaded from this same website.

my def for my crawler function is as follows so far:

def bbref_crawler(ID):
    url = 'https://www.baseball-reference.com/register/player.fcgi?id=' + str(ID)
    source_code = requests.get(url)
    page_soup = soup(source_code.text, features='lxml')

And the code that I have so far trying to obtain the player id's is as follows:

for x in nwl_offense:
    while len(nwl_offense['last']) > 6:
        id_last = len(nwl_offense['last']) - 1
    while len(nwl_offense['first']) > 3:
        id_first = len(nwl_offense['first']) - 1
    nwl_offense['player_id'] = (str(id_first) + '000' + str(id_last))

When I run the for / while loop it just never stops running and I am not sure how else to go about achieving the goal I set out for of automating the player id into another column of that dataframe, so i can easily use the crawler to obtain more information on the players that I need for a project.

This is what the first 5 rows of the dataframe, nwl_offense look like:

print(nwl_offense.head())
Rk            Name   Age     G  ...         WRC+        WRC   

    WSB     OWins
0  1.0     Brian Baker  20.0  14.0  ...   733.107636   2.007068  0.099775  0.189913
1  2.0    Drew Beazley  21.0  46.0  ...   112.669541  29.920766 -0.456988  2.655892
2  3.0  Jarrett Bickel  21.0  33.0  ...    85.017293  15.245547  1.419822  1.502232
3  4.0      Nate Boyle  23.0  21.0  ...  1127.591556   1.543534  0.000000  0.139136
4  5.0    Seth Brewer*  22.0  12.0  ...   243.655365   1.667671  0.099775  0.159319

 
Jensen_ray
  • 81
  • 2
  • 10
  • one problem is that the condition of your while loops don't change so the loop won't break. for example, if the condition `len(nwl_offense['last']) > 6` is `True` then the first while loop will never break unless you modify `nwl_offense['last']` within the loop – Derek O Jan 18 '22 at 01:43
  • you might want to dig a bit more into how these ids are created. If you look at this page https://www.baseball-reference.com/register/player.fcgi?initial=aa Wil Aaron and Willard Aaron would be the same in your methodology, whereas the site uses 001 and 002 to differentiate. You may have to scrape the player register to get the right name and id associated. – Jonathan Leon Jan 18 '22 at 02:55
  • Does this answer your question? [How to iterate over rows in a DataFrame in Pandas](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas) – Nick ODell Jan 18 '22 at 03:10

1 Answers1

0

As stated in the comments, I wouldn't try to create a function to make the ids, as there will likely be some "quirky" ones in there that might not follow that logic.

If you're just go through each letter search they have it divided by and get the id directly by the player url.

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.baseball-reference.com/register/player.fcgi'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

player_register_search = {}
searchLinks = soup.find('div', {'id':'div_players'}).find_all('li')
for each in searchLinks:
    links = each.find_all('a', href=True)
    for link in links:
        print(link)
        player_register_search[link.text] = 'https://www.baseball-reference.com/' + link['href']
        

tot = len(player_register_search)
playerIds = {}
for count, (k, link)in enumerate(player_register_search.items(), start=1):
    print(f'{count} of {tot} - {link}')
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    kLower = k.lower()
    playerSection = soup.find('div', {'id':f'all_players_{kLower}'})
    
    h2 = playerSection.find('h2').text
    #print('\t',h2)
    
    player_links = playerSection.find_all('a', href=True)
    for player in player_links:
        playerName = player.text.strip()
        playerId = player['href'].split('id=')[-1].strip()
        
        if playerName not in playerIds.keys():
            playerIds[playerName] = []
            
        #print(f'\t{playerName}: {playerId}')
        playerIds[playerName].append(playerId)



df = pd.DataFrame({'Player' : list(playerIds.keys()),
                   'id': list(playerIds.values())})

Output:

print(df)
                 Player              id
0          Scott A'Hara  [ahara-000sco]
1               A'Heasy  [ahease001---]
2             Al Aaberg  [aaberg001alf]
3          Kirk Aadland  [aadlan001kir]
4            Zach Aaker  [aaker-000zac]
                ...             ...
323628      Mike Zywica  [zywica001mic]
323629  Joseph Zywiciel  [zywici000jos]
323630    Bobby Zywicki  [zywick000bob]
323631  Brandon Zywicki  [zywick000bra]
323632       Nate Zyzda  [zyzda-000nat]

[323633 rows x 2 columns]

TO GET JUST THE PLAYERS FROM YOUR DATAFRAME:

THIS IS JUST AN EXAMPLE OF YOUR DATAFRAME. DO NOT INCLUDE THIS IN YOUR CODE

# Sample of the dataframe
nwl_offense = pd.DataFrame({'first':['Evan', 'Kelby'],
                            'last':['Albrecht', 'Golladay']})

Use this:

# YOU DATAFRAME - GET LIST OF NAMES
player_interest_list = list(nwl_offense['Name'])


nwl_players = df.loc[df['Player'].isin(player_interest_list)]

Output:

print(nwl_players)
                Player                            id
3095     Evan Albrecht  [albrec001eva, albrec000eva]
108083  Kelby Golladay                [gollad000kel]
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • this seems to be a pretty good solution, however I don't think it will work since the players ID's that I am trying to get are not on the player register since I am trying to scrape data on players that played colligate summer ball specifically in the northwoods league. I am not sure this would work for that. – Jensen_ray Jan 18 '22 at 15:11
  • Well you said you have the player names. You just filter this dataframe to only include rows of the list of player names you have.. – chitown88 Jan 18 '22 at 15:15
  • Give me a second and I can include that in the code. – chitown88 Jan 18 '22 at 15:16
  • give me an example of one of player names you are after. – chitown88 Jan 18 '22 at 15:22
  • @Jensen_ray, updated the solution. Check the bottom – chitown88 Jan 18 '22 at 15:36
  • I cant figure out how to not get an error running this, how does the variable nwl_offense not get reduced down to just those two names I put in. I think I am supposed to put the names of the first player and the last player I want ot operate on? But this doesnt work for me and I am not sure how to work around this – Jensen_ray Jan 18 '22 at 17:41
  • # Sample of the dataframe nwl_offense = pd.DataFrame({'first':['Brian', 'Baker'], 'last':['Satchell', 'Wilson']}) # YOU DATAFRAME - GET LIST OF NAMES player_interest_list = list(nwl_offense1['first'] + ' ' + nwl_offense['last']) nwl_players = nwl_offense.loc[nwl_offense['Player'].isin(player_interest_list)] – Jensen_ray Jan 18 '22 at 17:42
  • when I run this I just get key errors – Jensen_ray Jan 18 '22 at 17:42
  • @Jensen_ray it shouldnt get reduced to those 2 names. That's just a sample of what I am assuming looks like your dataset. You never shared what your `nwl_offense` dataframe looks like. – chitown88 Jan 19 '22 at 08:44
  • in your orginial post, edit it an include what the first few rows of `nwl_offense` looks like. – chitown88 Jan 19 '22 at 08:44
  • I added what the dataframe looks like just now. Apologies. – Jensen_ray Jan 20 '22 at 16:30
  • ok. So from the looks of it, you have `"Name"` as your column. I'll fix the code above – chitown88 Jan 20 '22 at 16:39
  • so you'll want to use `player_interest_list = list(nwl_offense['Name'])` to get the list of players to then use for `nwl_players = df.loc[df['Player'].isin(player_interest_list)]` – chitown88 Jan 20 '22 at 16:41