2

I am writing a python script that uses selenium to parse each page of basketball stats on ESPN over the last 18 years (each year's stats is its own web page). I am able to connect to the site and parse no problem, however my results are not populating in the terminal while the parsing is occurring. I used a regex checker to make sure the elements I am trying to grab (for now, just the value after "data-idx=" in the html) are correct and they seem to be so I am not too sure what I am doing wrong. Please see code below:

import requests
import pandas as pd
import re
import time
from selenium import webdriver

# Initializing parameters an tools
driver = webdriver.Chrome()
url = "https://www.espn.com/nba/stats/player/_/season/$NUM$/seasontype/2/table/offensive/sort/avgPoints/dir/desc"

# Parsing the starting page to calculate total number of pages
starting_URL = url.replace("$NUM$", str(2002))
print("Starting with:" + starting_URL)
driver.get(starting_URL)
starting_page_content = driver.page_source

# Collecting stats from all pages
for i in range(2001,2020):
    page_URL = url.replace("$NUM$", str(i+1))
    print("Collecting stats from: " + page_URL)
    driver.get(page_URL)
    time.sleep(1) # a good practice is to wait a little time between each HTTP request
    page_content = driver.page_source   # getting HTML source of page i

    all_chunks = re.compile(r'Table__TR--sm(.*?)data-idx=\"([^\"]+)\"').findall(page_content)  # @UndefinedVariable
    if len(all_chunks) > 0:  # if found any
        for chunk in all_chunks:
            #initialization
            player_index=""
        
            #parsing index
            indexes = re.compile(r'data-idx=\"([^\"]+)\"stack ',re.S|re.I).findall(str(chunk))  # @UndefinedVariable
            if(len(indexes) > 0):
                player_index = indexes.group(1)[0]
            print(player_index) # printing collected data to screen

driver.close()
alex_a
  • 25
  • 5

1 Answers1

0

You could use pandas.read_html to get the desired output.

dfs = pd.read_html(page_content)
pd.DataFrame.merge(*dfs, left_index=True, right_index=True)


RK  Name    POS GP  MIN PTS FGM FGA FG% 3PM ... FTA FT% REB AST STL BLK TO  DD2 TD3 PER
0   1   Allen IversonPHI    SG  60  43.7    31.4    11.1    27.8    39.8    1.3 ... 9.8 81.2    4.5 5.5 2.8 0.2 4.0 4   1   0.0
1   2   Shaquille O'NealLAL C   67  36.1    27.2    10.6    18.3    57.9    0.0 ... 10.7    55.5    10.7    3.0 0.6 2.0 2.6 40  0   0.0
2   3   Paul PierceBOS  SF  82  40.3    26.1    8.6 19.5    44.2    2.6 ... 7.8 80.9    6.9 3.2 1.9 1.0 2.9 17  0   0.0

A few other suggestions

  • don't use regular expressions to parse HTML, see this famous answer
  • use Selenium's functionality such as XPATH to locate your elements
  • You can replace placeholder in strings with Python builtin functionality, e.g. url.format(i)
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • 1
    You sir are a lifesaver, thank you! This crawler is a portion of a school project, and it's actually my teacher who's been saying regex is the best way to parse HTML. The post you linked me explains why otherwise (with a little extra) but makes a lot of sense. As for XPATH, I did try using that but it was a little harder to get a grasp on and I understood the regex parsing from having it done it before. – alex_a Nov 23 '20 at 21:45