DISCLAIMER: Total noob Python coder; just started Chapter 44 of "Learn Python the Hard Way", and I'm trying to do some side-projects on my own to supplement my learning.
I'm trying to write a script that serves as a back-end "admin interface" of sorts for me, allowing me to enter in a URL that holds a team's football schedule and from their automagically extract that schedule and then save it to a file to be accessed later.
I've been able to get as far as entering a URL in the Terminal, opening that URL, iterating over each line of HTML in that URL and then removing enough of the HTML tags to then have two separate elements displaying what I want (at least in terms of the strings contained...): the list of games and the list of dates for those games; they're saved in two separate lists that I save as HTML files to view in browser and confirm the data I'm gotten.
NOTE: These files get their file names by parsing down the URL.
Here's an example URL I'm working with: www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php
The problem I face now is two-fold:
1) Removing all HTML from the two lists, so that the only thing that remains are the strings in their respective indexes. I've tried BeautifulSoup, but I've been banging my head against a wall with it for the past day, combing through StackOverflow and trying different methods.
No dice (user error, I'm positive).
2) Then, in the list that contains the dates, combining each set of two indexes (i.e. combine 0 & 1, 2 & 3, 4 & 5, etc.) into a single string in a single list index.
From there, I believe I've found a method to combine the two lists into a single list (there's a lesson in Learn Python the Hard Way that covers this I believe, as well as a lot here on StackOverflow), but these two are real blockers for me at the moment.
Here's the code I've written, including notes for each step and for the steps that remain, but I have no working code for:
# Import necessary modules
from urllib import urlopen
import sys
import urlparse
# Take user input to get the URL where schedule lives
team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")
# Parse the URL to grab the 'path' segment to whittle down and use as the file name
file_name = urlparse.urlsplit(team_url)
# Parse the URL to make the file name:
name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".html"
name_final_s = name_after[0] + "sched" + ".html"
# Create an empty list to hold our HTML data:
team_data = []
schedule_data = []
# Grab the HTML file to then be written & parsed down to just team names:
for line in urlopen(team_url).readlines():
if "tr" in line:
if "a href=" in line:
if "strong" in line:
team_data.append(line.rstrip())
# Grab the HTML file to then be written & parsed down to just schedules:
for line in urlopen(team_url).readlines():
if 'td class="cfb1"' in line:
if "Buy" not in line:
schedule_data.append(line.rstrip())
# schedule_data[0::1] = [','.join(schedule_data[0::1])]
# Save team's game list file with contents of HTML:
with open(name_final, 'w') as fout:
fout.write(str(team_data))
# Save team's schedule file with contents of HTML:
with open(name_final_s, 'w') as fout:
fout.write(str(schedule_data))
# Remove all HTML tags from the game list file:
# Remove all HTML tags from the schedule list file:
# Combine necessary strings from the schedule list:
# Combine the two lists into a single list:
Any help would be greatly appreciated!
UPDATE: 5/27/2015, 9:42AM PST
So I toyed around a bit with the HTMLParser, and I think I'm getting there. Here's the new code (still working with this URL: http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php):
# Import necessary modules
from HTMLParser import HTMLParser
from urllib import urlopen
import sys
import urlparse
import os
# Take user input to get the URL where schedule lives
team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")
# Parse the URL to grab the 'path' segment to whittle down and use as the file name
file_name = urlparse.urlsplit(team_url)
# Parse the URL to make the file name:
name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".txt"
name_final_s = name_after[0] + "-dates" + ".txt"
# Create an empty list to hold our HTML data:
team_data = []
schedule_data = []
# Grab the HTML file to then be written & parsed down to just team names:
for line in urlopen(team_url).readlines():
if "tr" in line:
if "a href=" in line:
if "strong" in line:
team_data.append(line.rstrip())
# Grab the HTML file to then be written & parsed down to just schedules:
for line in urlopen(team_url).readlines():
if 'td class="cfb1"' in line:
if "Buy" not in line:
schedule_data.append(line.rstrip())
# schedule_data[0::1] = [','.join(schedule_data[0::1])]
# Save team's game list file with contents of HTML:
with open(name_final, 'w') as fout:
fout.write(str(team_data))
# Save team's schedule file with contents of HTML:
with open(name_final_s, 'w') as fout:
fout.write(str(schedule_data))
# Create file name path from pre-determined directory and added string:
game_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final))
schedule_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s))
# Utilize MyHTML Python HTML Parsing module via MyHTMLParser class
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Encountered a start tag:", tag
def handle_endtag(self, tag):
print "Encountered an end tag :", tag
def handle_data(self, data):
print "Encountered some data :", data
# Create a game instance of HTMLParser:
game_parser = MyHTMLParser()
# Create a schedule instance of HTMLParster:
sched_parser = MyHTMLParser()
# Create function that opens and reads each line in a file:
def open_game():
run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final)).readlines()
for x in run:
game_parser.feed(x)
def open_sched():
run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s)).readlines()
for x in run:
sched_parser.feed(x)
open_game()
open_sched()
# Combine necessary strings from the schedule list:
# Combine the two lists into a single list:
# Save again as .txt files
# with open(name_final, 'w') as fout:
# fout.write(str(team_data))
#
# with open(name_final_s, 'w') as fout:
# fout.write(str(schedule_data))
So, now I'm parsing through it, I just need to completely remove all HTML tags from the strings so it's just the opponents remaining and just the dates remaining in each separate file.
I'll keep working on it and will post back here with results, if there isn't a solution provided in the meantime.
Thanks for all help and insight so far, this rookie appreciates it a lot.