3

I have a unique situation while trying to scrape a website. I'm searching hundreds of names through the search bar and then scraping tables. however, some names are unique and are spelled differently on the my list compared to the site. in such cases, I looked up a couple names on the site manually it still takes me directly to the individual page. other times, it goes to the list of names if there are multiple guys with same or similar names (in that case, i want the person that played in the nba. i've already accounted for this, but i think it's necessary to mention). how do i go about still going into those players' individual pages instead of having to run the script every time and hit the error to see which player has a slightly different spelling? again, the name in the array will directly take you to the individual page even if spelled slightly different or a list of name (need the one in NBA). Some examples are Georgios Papagiannis (listed as George Papagiannis on website), Ognjen Kuzmic (listed as Ognen Kuzmic), Nene (listed as Maybyner Nene but will take you to a list of name -- https://basketball.realgm.com/search?q=nene). this seems pretty tough, but i feel like it might be possible. also, it seems like rather than writing all the scraped data on to the csv, it gets overwritten each time with the next player. thanks a ton.

the error I get: AttributeError: 'NoneType' object has no attribute 'text'

import requests
from bs4 import BeautifulSoup
import pandas as pd


playernames=['Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot']

result = pd.DataFrame()
for name in playernames:

    fname=name.split(" ")[0]
    lname=name.split(" ")[1]
    url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    if soup.find('a',text=name).text==name:
        url="https://basketball.realgm.com"+soup.find('a',text=name)['href']
        print(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')

    try:
        table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
        table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')

        df1 = pd.read_html(str(table1))[0]
        df2 = pd.read_html(str(table2))[0]

        commonCols = list(set(df1.columns) & set(df2.columns))
        df = df1.merge(df2, how='left', on=commonCols)
        df['Player'] = name
        print(df)
    except:
        print ('No international table for %s.' %name)
        df = pd.DataFrame([name], columns=['Player'])

result = result.append(df, sort=False).reset_index(drop=True)

cols = list(result.columns)
cols = [cols[-1]] + cols[:-1]
result = result[cols]
result.to_csv('international players.csv', index=False)
J. Doe
  • 269
  • 1
  • 8
  • 1
    The resulting url of your search provides a clue: if the url contains "player" then you can go ahead and scrape the desired table. If it does not - and this isn't foolproof if the search results table lists more than one NBA player (the Nene example doesn't) - search the table for values in the NBA column. If there are values there, then grab the href from the player result in that row. – foszter Jan 24 '20 at 21:33
  • Not clear what problem do you have – Sers Jan 25 '20 at 19:32
  • so if you go on https://basketball.realgm.com/ and type in Raul Neto into the search bar, it will go to his page, which is great. however, my code will throw an error ```AttributeError: 'NoneType' object has no attribute 'text'``` because the name on his individual page is Raulzinho Neto. i want to be able to still scrape the tables from his page without having to change the name on my list to Raulzinho Neto. foszter suggests using "player" from the url to deal with this, but i'm not sure how. is that more clear? – J. Doe Jan 25 '20 at 20:03
  • other times, my list will have the name Nene, and no last name, which i counted for and it will go to a list of players with similar names (https://basketball.realgm.com/search?q=nene). from there, i want to go to the person who played in the NBA, which i also accounted for. in this case, i want to go on the page of Maybyner Nene, and as you can see, he is the only one that played in the NBA. however, i face the same error as the Raul Neto example – J. Doe Jan 25 '20 at 20:05

1 Answers1

2

I used loop for NBA players with similar names. You can find below css selector below to get NBA players from the search table:

.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]

CSS selector meaning: find table by tablesaw class, find table's children tr with children a whose href contains /nba/teams/ text, then find a whose href contains /player/

I added Search Player Name and Real Player Name columns, that you can see how player was found. This columns placed as first and second column using insert (see comment in the code).

import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

base_url = 'https://basketball.realgm.com'
player_names = ['Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot']

result = pd.DataFrame()


def def get_player_stats(search_name = None, real_name = None, player_soup = None):
    table_per_game = player_soup.find('h2', text='International Regular Season Stats - Per Game')
    table_advanced_stats = player_soup.find('h2', text='International Regular Season Stats - Advanced Stats')

    if table_per_game and table_advanced_stats:
        print('International table for %s.' % search_name)

        df1 = pd.read_html(str(table_per_game.findNext('table')))[0]
        df2 = pd.read_html(str(table_advanced_stats.findNext('table')))[0]

        common_cols = list(set(df1.columns) & set(df2.columns))
        df = df1.merge(df2, how='left', on=common_cols)

        # insert name columns for the first positions
        df.insert(0, 'Search Player Name', search_name)
        df.insert(1, 'Real Player Name', real_name)
    else:
        print('No international table for %s.' % search_name)
        df = pd.DataFrame([[search_name, real_name]], columns=['Search Player Name', 'Real Player Name'])

    return df


for name in player_names:
    url = f'{base_url}/search?q={name.replace(" ", "+")}'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    if url == response.url:
        # Get all NBA players
        for player in soup.select('.tablesaw tr:has(a[href*="/nba/teams/"]) a[href*="/player/"]'):
            response = requests.get(base_url + player['href'])
            player_soup = BeautifulSoup(response.content, 'lxml')
            player_data = get_player_stats(search_name=player.text, real_name=name, player_soup=player_soup)
            result = result.append(player_data, sort=False).reset_index(drop=True)
    else:
        player_data = get_player_stats(search_name=name, real_name=name, player_soup=soup)
        result = result.append(player_data, sort=False).reset_index(drop=True)

result.to_csv('international players.csv', index=False)
# Append to existing file
# result.to_csv('international players.csv', index=False, mode='a')
Sers
  • 12,047
  • 2
  • 12
  • 31
  • hey, sorry i was away for a while. i'm getting a syntax error ```SyntaxError: invalid syntax``` for get_player_stats. looks ok to me but it seems like something is wrong with the way you wrote the arguments. did it run for you? – J. Doe Jan 26 '20 at 01:54
  • Replace with - def get_player_stats(search_name = None, real_name = None, player_soup = None): – Sers Jan 26 '20 at 08:05
  • i was curious. if i wanted to use a list of names that exist in an excel and grab them to do what this script is doing, is it possible? i can't really find anything that will allow me to continuously add the same name for each row. so if the 'Carlos Delfino', 'Nene', 'Yao Ming', 'Marcus Vinicius', 'Raul Neto', 'Timothe Luwawu-Cabarrot' are in the first column, is it possible add more rows to each of those names for every season they have? pretty much what the script is already doing, but getting names from an existing file and adding the scraped info to the file – J. Doe Jan 30 '20 at 18:59
  • You can read data from excel, but it's not so simple. Also you can get all players names and data by season without enter it manually – Sers Jan 30 '20 at 20:55
  • yes, this was amazing. thank you for that. i want to just see if i can make the task even easier. my other question was regarding pandas, which i am not too familiar with. how does list(), set(), and merge() exactly work in this situation? i have some idea, but not completely clear – J. Doe Jan 30 '20 at 21:04
  • I cannot help in that, better to read documentation with examples – Sers Jan 30 '20 at 21:06