1

I got advice from Jamie Bull and PM 2Ring to use the CSV module for the output of my web scraper . I'm nearly done but have an issue with some parsed items that are separated by a colon or hyphen. I'm wanting those items split into two items in the current list.

Current output:

GB,16,19,255,1,26:40,19,13,4,2,6-12,0-1,255,57,4.5,80,21,3.8,175,23-33,4.9,3,14,1,4,38.3,8,65,1,0 Sea,36,25,398,1,33:20,25,8,13,4,4-11,1-1,398,66,6.0,207,37,5.6,191,19-28,6.6,1,0,0,2,33.0,4,69,2,1

Desired output:(The issues/differences are in bold)

GB,16,19,255,1,26,40,19,13,4,2,6,12,0,1,255,57,4.5,80,21,3.8,175,23,33,4.9,3,14,1,4,38.3,8,65,1,0 Sea,36,25,398,1,33,20,25,8,13,4,4,11,1,1,398,66,6,207,37,5.6,191,19,28,6.6,1,0,0,2,33,4,69,2,1

I am unsure where or how to make these changes. I also don't know if regex is needed. Obviously I could handle this in notepad or Excel but my goal is to handle all this in Python.

If you run the program, the above results are from the 2014 season, week 1.

import requests
import re
from bs4 import BeautifulSoup
import csv

year_entry = raw_input("Enter year: ")

week_entry = raw_input("Enter week number: ")

week_link = requests.get("http://sports.yahoo.com/nfl/scoreboard/?week=" + week_entry + "&phase=2&season=" + year_entry)

page_content = BeautifulSoup(week_link.content)

a_links = page_content.find_all('tr', {'class': 'game link'})

csvfile = open('NFL_2014.csv', 'a')

writer = csv.writer(csvfile)

for link in a_links:
        r = 'http://www.sports.yahoo.com' + str(link.attrs['data-url'])
        r_get = requests.get(r)
        soup = BeautifulSoup(r_get.content)
        stats = soup.find_all("td", {'class':'stat-value'})
        teams = soup.find_all("th", {'class':'stat-value'})
        scores = soup.find_all('dd', {"class": 'score'})
                
        try:
                away_game_stats = []
                home_game_stats = []
                statistic = []
                game_score = scores[-1]
                game_score = game_score.text
                x = game_score.split(" ")
                away_score = x[1]
                home_score = x[4]
                home_team = teams[1]
                away_team = teams[0]
                away_team_stats = stats[0::2]
                home_team_stats = stats[1::2]
                away_game_stats.append(away_team.text)
                away_game_stats.append(away_score)
                home_game_stats.append(home_team.text)
                home_game_stats.append(home_score)
                for stats in away_team_stats:
                        text = stats.text.strip("").encode('utf-8')
                        away_game_stats.append(text)
                        
                
                writer.writerow(away_game_stats)

                for stats in home_team_stats:
                        text = stats.text.strip("").encode('utf-8')
                        home_game_stats.append(text)

                writer.writerow(home_game_stats)
                        
        except:
                pass
                        

csvfile.close()                         

Any help is greatly appreciated. This is my first program and searching this board has been a great resource.

Thanks,

JT

J.T.
  • 25
  • 3
  • As a side note: that except/pass is dangerous because it hides any type of error. See http://stackoverflow.com/questions/21553327/why-is-except-pass-a-bad-programming-practice – user2314737 Dec 11 '14 at 11:24

2 Answers2

0
import re
print re.sub(r"-|:",",",test_string)

See demo.

https://regex101.com/r/aQ3zJ3/2

vks
  • 67,027
  • 10
  • 91
  • 124
  • I used writer.writerow([re.sub(r"-|:",',',s)for s in home_game_stats]) which eliminated the colon and hyphen but now the item is grouped by quotation marks, making it still one item in the csv file instead of two separate items. – J.T. Dec 07 '14 at 16:24
  • @J.T. apply it on the whole line like `GB,16,19,255,1,26:40,19,13,4,2,6-12,0-1,255,57,4.5,80,21,3.8,175,23-33,4.9,3,14,1,4,38.3,8,65,1,0 ` not on individual items. – vks Dec 07 '14 at 16:28
0

You can use regular expressions to split the strings and then "flatten" the list in order to avoid the grouping by quotation marks like this:

Substitute

writer.writerow(away_game_stats)

with

away_game_stats = [re.split(r"-|:",x) for x in away_game_stats]
writer.writerow([x for y in away_game_stats for x in y])

(and same for writer.writerow(home_game_stats))

user2314737
  • 27,088
  • 20
  • 102
  • 114