0

I have attempted several methods to pull links from the following webpage, but can't seem to find the desired links. From this webpage (https://www.espn.com/collegefootball/scoreboard//year/2019/seasontype/2/week/1) I am attempting to extract all of the links for the "gamecast" button. The example of the first one I would be attempting to get is this: https://www.espn.com/college-football/game//gameId/401110723

When I try to just pull all links on the page I do not even seem to get the desired ones at all, so I'm confused where I'm going wrong here. A few attempts I have made below that don't seem to be pulling in what I want. First method I tried below.

import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get('https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(page.text, 'html.parser')
# game_id = soup.find(name_='&lpos=college-football:scoreboard:gamecast')
game_id = soup.find('a',class_='button-alt sm')

Here is a second method I tried. Any help is greatly appreciated.

for a in soup.find_all('a'):
if 'college-football' in a['href']:
print(link['href'])

Edit: as a clarification I am attempting to pull all links that contain a gameID as in the example link.

user2355903
  • 593
  • 2
  • 8
  • 29
  • possible duplicate to: https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup – Ammar Sep 25 '19 at 20:48
  • I have referenced that page, and have one of the solutions from in my question as one that doesn't work. As far as I can tell it does not work for my issue. – user2355903 Sep 25 '19 at 20:50

1 Answers1

0

The button with the link you are trying to have is loaded with javascript. The requests module does not load the javascript in the html it is searching through. Therefore, you cannot scrape the button directly to find the links you desire (without a web page simulator like Selenium). However, I found json data in the html that contains the scoreboard data in which the link is located in. If you are also looking to scrape more information (times, etc.) from this page, I highly recommend looking through the json data in the variable json_scoreboard in the code.

Code

import requests, re, json
from bs4 import BeautifulSoup

r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(r.text, 'html.parser')

scripts_head = soup.find('head').find_all('script')
all_links = {}
for script in scripts_head:
    if 'window.espn.scoreboardData' in script.text:
        json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
        for event in json_scoreboard['events']:
            name = event['name']
            for link in event['links']:
                if link['text'] == 'Gamecast':
                    gamecast = link['href']
            all_links[name] = gamecast

print(all_links)

Output

{'Miami Hurricanes at Florida Gators': 'http://www.espn.com/college-football/game/_/gameId/401110723', 'Georgia Tech Yellow Jackets at Clemson Tigers': 'http://www.espn.com/college-football/game/_/gameId/401111653', 'Texas State Bobcats at Texas A&M Aggies': 'http://www.espn.com/college-football/game/_/gameId/401110731', 'Utah Utes at BYU Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114223', 'Florida A&M Rattlers at UCF Knights': 'http://www.espn.com/college-football/game/_/gameId/401117853', 'Tulsa Golden Hurricane at Michigan State Spartans': 'http://www.espn.com/college-football/game/_/gameId/401112212', 'Wisconsin Badgers at South Florida Bulls': 'http://www.espn.com/college-football/game/_/gameId/401117856', 'Duke Blue Devils at Alabama Crimson Tide': 'http://www.espn.com/college-football/game/_/gameId/401110720', 'Georgia Bulldogs at Vanderbilt Commodores': 'http://www.espn.com/college-football/game/_/gameId/401110732', 'Florida Atlantic Owls at Ohio State Buckeyes': 'http://www.espn.com/college-football/game/_/gameId/401112251', 'Georgia Southern Eagles at LSU Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110725', 'Middle Tennessee Blue Raiders at Michigan Wolverines': 'http://www.espn.com/college-football/game/_/gameId/401112222', 'Louisiana Tech Bulldogs at Texas Longhorns': 'http://www.espn.com/college-football/game/_/gameId/401112135', 'Oregon Ducks at Auburn Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110722', 'Eastern Washington Eagles at Washington Huskies': 'http://www.espn.com/college-football/game/_/gameId/401114233', 'Idaho Vandals at Penn State Nittany Lions': 'http://www.espn.com/college-football/game/_/gameId/401112257', 'Miami (OH) RedHawks at Iowa Hawkeyes': 'http://www.espn.com/college-football/game/_/gameId/401112191', 'Northern Iowa Panthers at Iowa State Cyclones': 'http://www.espn.com/college-football/game/_/gameId/401112085', 'Syracuse Orange at Liberty Flames': 'http://www.espn.com/college-football/game/_/gameId/401112434', 'New Mexico State Aggies at Washington State Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114228', 'South Alabama Jaguars at Nebraska Cornhuskers': 'http://www.espn.com/college-football/game/_/gameId/401112238', 'Northwestern Wildcats at Stanford Cardinal': 'http://www.espn.com/college-football/game/_/gameId/401112245', 'Houston Cougars at Oklahoma Sooners': 'http://www.espn.com/college-football/game/_/gameId/401112114', 'Notre Dame Fighting Irish at Louisville Cardinals': 'http://www.espn.com/college-football/game/_/gameId/401112436'}
LuckyZakary
  • 1,171
  • 1
  • 7
  • 19
  • This is awesome, thank you. The one thing I'm having a little trouble with is turning this dictionary into a more traditional dataframe where the key (Miami Hurricanes at Florida Gators) is in the first column and the value (actual link) is in the second. Is this something that should be asked in a new question or addressed here? – user2355903 Sep 25 '19 at 22:16
  • Actually I just figured it out, for any interested. data2 = pd.DataFrame.from_dict(all_links,orient='index') – user2355903 Sep 25 '19 at 22:19