-1

I am trying to learn how to scrape data from a webpage in python and am running into trouble with how to structure my nested loops in python. I received some assistance in how I was scraping with this question (How to pull links from within an 'a' tag). I am trying to have that code essentially iterate through different weeks (and eventually years) of webpages. What I have currently is below, but it is not iterating through the two weeks I would like it to and saving it off.

import requests, re, json
from bs4 import BeautifulSoup
weeks=['1','2']
data = pd.DataFrame(columns=['Teams','Link'])

scripts_head = soup.find('head').find_all('script')
all_links = {}
for i in weeks:
    r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2018/seasontype/2/week/'+i)
    soup = BeautifulSoup(r.text, 'html.parser')
    for script in scripts_head:
        if 'window.espn.scoreboardData' in script.text:
            json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
            for event in json_scoreboard['events']:
                name = event['name']
                for link in event['links']:
                    if link['text'] == 'Gamecast':
                        gamecast = link['href']
                all_links[name] = gamecast
                #Save data to dataframe
                data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
        #Append new data to existing data        
        data=data.append(data2,ignore_index = True)


#Save dataframe with all links to csv for future use
data.to_csv(r'game_id_data.csv')

Edit: So to add some clarification, it is creating duplicates of the data from one week and repeatedly appending it to the end. I also edited the code to include the proper libraries, it should be able to be copy and pasted and run in python.

user2355903
  • 593
  • 2
  • 8
  • 29
  • 1
    Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [Minimal, complete, verifiable example](https://stackoverflow.com/help/minimal-reproducible-example) applies here. We cannot effectively help you until you post your MCVE code and accurately specify the problem. We should be able to paste your posted code into a text file and reproduce the problem you specified. "It is not working" is not a problem specification. – Prune Sep 26 '19 at 17:46
  • Please see the edited question and let me know if there is still an issue. – user2355903 Sep 26 '19 at 18:32

2 Answers2

0

The problem is in your loop logic:

    if 'window.espn.scoreboardData' in script.text:
        ...
            data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
    #Append new data to existing data        
    data=data.append(data2,ignore_index = True)

Your indentation on the last line is wrong. As given, you append data2 regardless of whether you have new scoreboard data. When you don't, you skip the if body and simply append the previous data2 value.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • So I've tried that last line at all possible indentations, none of which seems to return correctly. Are you saying I need to add in logic deleting the previous run before the next iteration? – user2355903 Sep 26 '19 at 18:48
  • No, just that the immediate problem is to put that `append` command *under* the `if` rather than afterward. If you have further problems, please post a new question or update the existing one as suggested in the MCVE description. – Prune Sep 26 '19 at 21:23
0

So the workaround I came up with is below, I am still getting duplicate game ID's in my final dataset, but at least I am looping through the entire desired set and getting all of them. Then at the end I dedupe.

import requests, re, json
from bs4 import BeautifulSoup
import csv
import pandas as pd

years=['2015','2016','2017','2018']
weeks=['1','2','3','4','5','6','7','8','9','10','11','12','13','14']
data = pd.DataFrame(columns=['Teams','Link'])

all_links = {}
for year in years:
    for i in weeks:
        r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/'+ year + '/seasontype/2/week/'+i)
        soup = BeautifulSoup(r.text, 'html.parser')
        scripts_head = soup.find('head').find_all('script')
        for script in scripts_head:
            if 'window.espn.scoreboardData' in script.text:
                json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
                for event in json_scoreboard['events']:
                    name = event['name']
                    for link in event['links']:
                        if link['text'] == 'Gamecast':
                            gamecast = link['href']
                    all_links[name] = gamecast
                #Save data to dataframe
                data2=pd.DataFrame(list(all_links.items()),columns=['Teams','Link'])
                #Append new data to existing data        
                data=data.append(data2,ignore_index = True)


#Save dataframe with all links to csv for future use
data_test=data.drop_duplicates(keep='first')
data_test.to_csv(r'all_years_deduped.csv')
user2355903
  • 593
  • 2
  • 8
  • 29