How come my web scraping results won't print after looping through each web page?

Question

I am writing a python script that uses selenium to parse each page of basketball stats on ESPN over the last 18 years (each year's stats is its own web page). I am able to connect to the site and parse no problem, however my results are not populating in the terminal while the parsing is occurring. I used a regex checker to make sure the elements I am trying to grab (for now, just the value after "data-idx=" in the html) are correct and they seem to be so I am not too sure what I am doing wrong. Please see code below:

import requests
import pandas as pd
import re
import time
from selenium import webdriver

# Initializing parameters an tools
driver = webdriver.Chrome()
url = "https://www.espn.com/nba/stats/player/_/season/$NUM$/seasontype/2/table/offensive/sort/avgPoints/dir/desc"

# Parsing the starting page to calculate total number of pages
starting_URL = url.replace("$NUM$", str(2002))
print("Starting with:" + starting_URL)
driver.get(starting_URL)
starting_page_content = driver.page_source

# Collecting stats from all pages
for i in range(2001,2020):
    page_URL = url.replace("$NUM$", str(i+1))
    print("Collecting stats from: " + page_URL)
    driver.get(page_URL)
    time.sleep(1) # a good practice is to wait a little time between each HTTP request
    page_content = driver.page_source   # getting HTML source of page i

    all_chunks = re.compile(r'Table__TR--sm(.*?)data-idx=\"([^\"]+)\"').findall(page_content)  # @UndefinedVariable
    if len(all_chunks) > 0:  # if found any
        for chunk in all_chunks:
            #initialization
            player_index=""
        
            #parsing index
            indexes = re.compile(r'data-idx=\"([^\"]+)\"stack ',re.S|re.I).findall(str(chunk))  # @UndefinedVariable
            if(len(indexes) > 0):
                player_index = indexes.group(1)[0]
            print(player_index) # printing collected data to screen

driver.close()

score 0 · Accepted Answer · answered Nov 23 '20 at 20:48

You could use pandas.read_html to get the desired output.

dfs = pd.read_html(page_content)
pd.DataFrame.merge(*dfs, left_index=True, right_index=True)


RK  Name    POS GP  MIN PTS FGM FGA FG% 3PM ... FTA FT% REB AST STL BLK TO  DD2 TD3 PER
0   1   Allen IversonPHI    SG  60  43.7    31.4    11.1    27.8    39.8    1.3 ... 9.8 81.2    4.5 5.5 2.8 0.2 4.0 4   1   0.0
1   2   Shaquille O'NealLAL C   67  36.1    27.2    10.6    18.3    57.9    0.0 ... 10.7    55.5    10.7    3.0 0.6 2.0 2.6 40  0   0.0
2   3   Paul PierceBOS  SF  82  40.3    26.1    8.6 19.5    44.2    2.6 ... 7.8 80.9    6.9 3.2 1.9 1.0 2.9 17  0   0.0

A few other suggestions

don't use regular expressions to parse HTML, see this famous answer
use Selenium's functionality such as XPATH to locate your elements
You can replace placeholder in strings with Python builtin functionality, e.g. url.format(i)

You sir are a lifesaver, thank you! This crawler is a portion of a school project, and it's actually my teacher who's been saying regex is the best way to parse HTML. The post you linked me explains why otherwise (with a little extra) but makes a lot of sense. As for XPATH, I did try using that but it was a little harder to get a grasp on and I understood the regex parsing from having it done it before. — alex_a, Nov 23 '20 at 21:45

How come my web scraping results won't print after looping through each web page?

1 Answers1