1

I am a novice at Python and helping out on a school project. Any help is much appreciated. THANKS. I get an error when it gets to the year 2004 and 2003. And it is caused by the result_list list. The error is "ValueError: arrays must all be same length". How can I introduce code that fixes this. The scores are important....

import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
#from openpyxl.writer.excel import ExcelWriter
import openpyxl
#from openpyxl import load_workbook
import csv

year_id = ['2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','2009','2008','2007','2006','2005','2004','2003']

i=0
while i <= len(year_id)-1: 
    url = 'https://lehighsports.com/sports/mens-soccer/schedule/' + str(year_id[i])
    lehigh = requests.get(url).text
    soup = BeautifulSoup(lehigh,'lxml')

    date_list = []
    for date in soup.find_all('div',class_="sidearm-schedule-game-opponent-date"):
        date_list.append(date.get_text(strip=True, separator=' '))

    name_list = []
    for name in soup.find_all('div',class_="sidearm-schedule-game-opponent-name"):
        name_list.append(name.get_text(strip=True, separator=' '))

    result_list = []
    for result in soup.find_all('div',class_="sidearm-schedule-game-result"):
        result_list.append(result.get_text(strip=True, separator=' '))

    opp_list = []
    for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
        opp_list.append(opp.get_text(strip=True, separator=' '))

    conf_list = []
    for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
        conf_list.append(conf.get_text(strip=True))

    dict = {'date':date_list,'opponent':name_list,'result':result_list,'list':opp_list,'conference':conf_list}
    df = pd.DataFrame(dict)

    workbook1 = openpyxl.load_workbook('lehigh.xlsx')
    writer = pd.ExcelWriter('lehigh.xlsx', engine='openpyxl') 
    writer.book = workbook1
    df.to_excel(writer, sheet_name=str(year_id[i]),index=False,startrow=0,startcol=0)
    writer.save()
    writer.close()

    i = i+1
Herman L
  • 165
  • 1
  • 7

2 Answers2

2

Code is updated:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import zip_longest

d = []
n = []
res = []
op = []
yr = []
with requests.Session() as req:
    for year in range(2003, 2020):
        print(f"Extracting Year# {year}")
        r = req.get(
            f"https://lehighsports.com/sports/mens-soccer/schedule/{year}")
        if r.status_code == 200:
            soup = BeautifulSoup(r.text, 'html.parser')
            for date in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-date flex-item-1'}):
                d.append(date.get_text(strip=True, separator=" "))
            for name in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-name'}):
                n.append(name.get_text(strip=True))
            for result in soup.findAll("div", {'class': 'sidearm-schedule-game-result'}):
                result = result.get_text(strip=True)
                res.append(result)
            if len(d) != len(res):
                res.append("None")
            for opp in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-text'}):
                op.append(opp.get_text(strip=True, separator=' '))
                yr.append(year)


data = []
for items in zip_longest(yr, d, n, res, op):
    data.append(items)

df = pd.DataFrame(data, columns=['Year', 'Date', 'Name', 'Result', 'Opponent']).to_excel(
    'lehigh.xlsx', index=False)

Output: check-online

  • this is amazing and even faster. you are brilliant. thanks so much. this awesome. – Herman L Jan 07 '20 at 19:26
  • of course happy to. Also, one more question how can I get the year into column 1 so I know from which year the score line is coming from. Your code is awesome - so clean and fast. thanks – Herman L Jan 07 '20 at 19:38
  • How do I upvote? Sorry never done that before (no-one's asled). Also, one more favor - is there a way to split the game result (sidearm-schedule-game-result text-italic) e.g. W, 2-0 into "W" and "2-0". Can't figure how to get each of those separately when "inspecting" the HTML. Thanks. Please let me know how to "up-vote". Been really helpful. You should start charging $$!! – Herman L Jan 08 '20 at 03:56
  • upvoted.. finally figured out it was the check mark! done. Also, one more favor - is there a way to split the game result (sidearm-schedule-game-result text-italic) e.g. W, 2-0 into "W" and "2-0". Can't figure how to get each of those separately when "inspecting" the HTML – Herman L Jan 08 '20 at 15:25
  • hey αԋɱҽԃ αмєяιcαη - hate to say this butI spot chekced the Georgetown score where Lehigh lost 2-1. In your code it is pullingin a win of 1-0. Something is wrong. chitown88 code works but that person puts it into separate tabs - I prefer yours in one tab. I can not figure out whats wrong with your code as the scores are wrong. sorry to inform you. tanks – Herman L Jan 08 '20 at 22:21
  • I see that the last few years the scores are shifted by 1 row hence loss to Georgetown 2-1 is in the row above it. Also, when I add your code res.append(result.get_text(strip=True, separator=' ').split(',')) it simply puts the whole thing between []. Sorry to be a pain. Thanks – Herman L Jan 08 '20 at 22:30
  • Hey Ahmed, sorry to bother you again but I can not figure out whats wrong in your code. It runs fine but I see that the last few years the scores are shifted by 1 row hence loss to Georgetown 2-1 is in the row above it - it shows up as a win 1-0!! . Also, when I add your code res.append(result.get_text(strip=True, separator=' ').split(',')) it simply puts the whole thing between []. Tryingot split out the W from the W, 1-0. Sorry to be a pain. Thanks – Herman L Jan 09 '20 at 15:38
2

A few things:

  1. You don't need to iterate through index. Just simply iterate through the list
  2. The reason you get error is the result list is a length of 23, while your other lists are length of 24. So you'll need to figure out how to deal with nulls, and deal with were they fall (they may not always be the last entry)

How I would do it, is I would grab each row, and then pull the data for that, as opposed to pulling each entity into a list. I then take all those rows on the site and create a table, and make a list of tables (1 table for each year). The way you can deal with the missing data is use try/except. I also added a little function (found here) that will take that list of tables and write them into separate excel sheets.

import requests
import pandas as pd
from pandas import ExcelWriter
from bs4 import BeautifulSoup
#from openpyxl.writer.excel import ExcelWriter
import openpyxl
#from openpyxl import load_workbook
import csv

year_id = ['2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','2009','2008','2007','2006','2005','2004','2003']


results = []
for year in year_id: 
    url = 'https://lehighsports.com/sports/mens-soccer/schedule/' + year
    print (url)
    lehigh = requests.get(url).text
    soup = BeautifulSoup(lehigh,'lxml')

    rows = soup.find_all('div',class_="sidearm-schedule-game-row flex flex-wrap flex-align-center row")

    sheet = pd.DataFrame()
    for row in rows:
        date = row.find('div',class_="sidearm-schedule-game-opponent-date").text.strip()
        name = row.find('div',class_="sidearm-schedule-game-opponent-name").text.strip()
        opp = row.find('div',class_="sidearm-schedule-game-opponent-text").text.strip()
        conf = row.find('div',class_="sidearm-schedule-game-conference-conference").text.strip()

        try:
            result = row.find('div',class_="sidearm-schedule-game-result").text.strip()
        except:
            result = ''

        df = pd.DataFrame([[year,date,name,opp,conf,result]], columns=['year','date','opponent','list','conference','result'])
        sheet = sheet.append(df,sort=True).reset_index(drop=True)

    results.append(sheet)



def save_xls(list_dfs, xls_path):
    with ExcelWriter(xls_path) as writer:
        for n, df in enumerate(list_dfs):
            df.to_excel(writer,'%s' %year_id[n],index=False,)
        writer.save()

save_xls(results,'lehigh.xlsx')
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • how can I get the year into column 1 in excel? – Herman L Jan 07 '20 at 19:36
  • Just include it in the dataframe. Solution edited above. Be sure to accept this solution. – chitown88 Jan 07 '20 at 21:54
  • accepted the solution. Thanks. Sorry to be a pain. How can I put them all into one sheet instead of their own tabs? Also, how can I get the score (e.g. T, 2-0) split out into T and 2-0. They are in between separate T, . Thanks chitown88. Really appreciate the help. – Herman L Jan 08 '20 at 22:39
  • No pain. Super easy modification to make. I’ll get them done within the hour and let you know when it’s done. – chitown88 Jan 09 '20 at 07:31
  • hi there. so this code works to split the score into 3 columns. so instead of: try: result = row.find('div',class_="sidearm-schedule-game-result").text.strip() except: result = '' I would like to incorporate this instead: result = re.findall(r'([A-Z]),\s+([\d-]+)\s*(.*)', row.select_one('.sidearm-schedule-game-result').get_text(strip=True, separator=' '))[0] Any idea how to do that?? Thanks very much – Herman L Jan 10 '20 at 01:44
  • @chitown88 Where would I throw in code to have the columns summed on each sheet? I'm assuming outside of the second ```for``` loop and before ```results.append(sheet)```? Here's what I'd like appended, more or less: ```last_row = ['Total'] + list(sheet.sum())[1:]``` ```df2 = pd.DataFrame(data=[last_row], columns=sheet.columns)``` – Pfalbaum Feb 05 '21 at 20:11
  • I’ll have to look at it a little later. Not near my laptop at the moment. – chitown88 Feb 05 '21 at 21:01