Iterate and remove HTML from list elements in python

Question

DISCLAIMER: Total noob Python coder; just started Chapter 44 of "Learn Python the Hard Way", and I'm trying to do some side-projects on my own to supplement my learning.

I'm trying to write a script that serves as a back-end "admin interface" of sorts for me, allowing me to enter in a URL that holds a team's football schedule and from their automagically extract that schedule and then save it to a file to be accessed later.

I've been able to get as far as entering a URL in the Terminal, opening that URL, iterating over each line of HTML in that URL and then removing enough of the HTML tags to then have two separate elements displaying what I want (at least in terms of the strings contained...): the list of games and the list of dates for those games; they're saved in two separate lists that I save as HTML files to view in browser and confirm the data I'm gotten.

NOTE: These files get their file names by parsing down the URL.

Here's an example URL I'm working with: www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php

The problem I face now is two-fold:

1) Removing all HTML from the two lists, so that the only thing that remains are the strings in their respective indexes. I've tried BeautifulSoup, but I've been banging my head against a wall with it for the past day, combing through StackOverflow and trying different methods.

No dice (user error, I'm positive).

2) Then, in the list that contains the dates, combining each set of two indexes (i.e. combine 0 & 1, 2 & 3, 4 & 5, etc.) into a single string in a single list index.

From there, I believe I've found a method to combine the two lists into a single list (there's a lesson in Learn Python the Hard Way that covers this I believe, as well as a lot here on StackOverflow), but these two are real blockers for me at the moment.

Here's the code I've written, including notes for each step and for the steps that remain, but I have no working code for:

# Import necessary modules

from urllib import urlopen
import sys
import urlparse

# Take user input to get the URL where schedule lives

team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")

# Parse the URL to grab the 'path' segment to whittle down and use as the file name

file_name = urlparse.urlsplit(team_url)

# Parse the URL to make the file name:

name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".html"
name_final_s = name_after[0] + "sched" + ".html"

# Create an empty list to hold our HTML data:

team_data = []
schedule_data = []

# Grab the HTML file to then be written & parsed down to just team names:

for line in urlopen(team_url).readlines():
    if "tr"  in line:
        if "a href=" in line:
            if "strong" in line:
                team_data.append(line.rstrip())

# Grab the HTML file to then be written & parsed down to just schedules:

for line in urlopen(team_url).readlines():
    if 'td class="cfb1"' in line:
        if "Buy" not in line:
            schedule_data.append(line.rstrip())
            # schedule_data[0::1] = [','.join(schedule_data[0::1])]

# Save team's game list file with contents of HTML:

with open(name_final, 'w') as fout:
    fout.write(str(team_data))

# Save team's schedule file with contents of HTML:

with open(name_final_s, 'w') as fout:
    fout.write(str(schedule_data))

# Remove all HTML tags from the game list file:



# Remove all HTML tags from the schedule list file:


# Combine necessary strings from the schedule list:


# Combine the two lists into a single list:

Any help would be greatly appreciated!

UPDATE: 5/27/2015, 9:42AM PST

So I toyed around a bit with the HTMLParser, and I think I'm getting there. Here's the new code (still working with this URL: http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php):

# Import necessary modules

from HTMLParser import HTMLParser
from urllib import urlopen
import sys
import urlparse
import os

# Take user input to get the URL where schedule lives

team_url = raw_input("Insert the full URL of the team's schedule you'd like to parse: ")

# Parse the URL to grab the 'path' segment to whittle down and use as the file name

file_name = urlparse.urlsplit(team_url)

# Parse the URL to make the file name:

name_base = file_name.path
name_before = name_base.split("/")
name_almost = name_before[3]
name_after = name_almost.split(".")
name_final = name_after[0] + ".txt"
name_final_s = name_after[0] + "-dates" + ".txt"

# Create an empty list to hold our HTML data:

team_data = []
schedule_data = []

# Grab the HTML file to then be written & parsed down to just team names:

for line in urlopen(team_url).readlines():
    if "tr"  in line:
        if "a href=" in line:
            if "strong" in line:
                team_data.append(line.rstrip())

# Grab the HTML file to then be written & parsed down to just schedules:

for line in urlopen(team_url).readlines():
    if 'td class="cfb1"' in line:
        if "Buy" not in line:
            schedule_data.append(line.rstrip())
            # schedule_data[0::1] = [','.join(schedule_data[0::1])]

# Save team's game list file with contents of HTML:

with open(name_final, 'w') as fout:
    fout.write(str(team_data))

# Save team's schedule file with contents of HTML:

with open(name_final_s, 'w') as fout:
    fout.write(str(schedule_data))

# Create file name path from pre-determined directory and added string:

game_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final))
schedule_file = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s))

# Utilize MyHTML Python HTML Parsing module via MyHTMLParser class

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data :", data

# Create a game instance of HTMLParser:

game_parser = MyHTMLParser()


# Create a schedule instance of HTMLParster:

sched_parser = MyHTMLParser()


# Create function that opens and reads each line in a file:

def open_game():
    run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final)).readlines()
    for x in run:
        game_parser.feed(x)

def open_sched():
    run = open(os.path.join('/Users/jmatthicks/Documents/' + name_final_s)).readlines()
    for x in run:
        sched_parser.feed(x)


open_game()
open_sched()


# Combine necessary strings from the schedule list:



# Combine the two lists into a single list:


# Save again as .txt files

# with open(name_final, 'w') as fout:
#   fout.write(str(team_data))
#   
# with open(name_final_s, 'w') as fout:
#   fout.write(str(schedule_data))

So, now I'm parsing through it, I just need to completely remove all HTML tags from the strings so it's just the opponents remaining and just the dates remaining in each separate file.

I'll keep working on it and will post back here with results, if there isn't a solution provided in the meantime.

Thanks for all help and insight so far, this rookie appreciates it a lot.

An older question here deals with removing html tags from text. http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python — Christopher Ian Stern, May 26 '15 at 05:16
Can you provide the URL of the page or otherwise the two lists that you are able to get, plus the ideal outcome of the script ? — gl051, May 26 '15 at 20:58
Crap, my bad on that. Here's the URL, and I'll edit the OP to include this as well: www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php — alamo1836, May 27 '15 at 00:28
And thanks a ton Christopher, I'm going to try that, and will report back. — alamo1836, May 27 '15 at 03:41

score 0 · Answer 1 · answered May 29 '15 at 05:52

If you're curious how to use BeatifulSoup for this, here is a stab at part (1):

First make sure you have the right version installed:

$ pip install beautifulsoup4

In your python shell:

from bs4 import BeautifulSoup
from urllib import urlopen
team_url = "http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php"
text = urlopen(team_url).read()
soup = BeautifulSoup(text)
table = soup.find('table', attrs={"class": "cfb-sch"})
data = []

for row in table.find_all('tr'):
    data.append([cell.text.strip() for cell in row.find_all('td')])

print data

# should print out something like:
#[[u'2015 Texas A&M Aggies Football Schedule'],
# [u'Date', u'', u'Opponent', u'Time/TV', u'Tickets'],
# [u'SaturdaySep. 5',
#  u'',
#  u'Arizona State Sun Devils \r\n      NRG Stadium, Houston, TX',
#  u'7:00 p.m. CT\r\nESPN network',
#  u'Buy\r\nTickets'],
# [u'SaturdaySep. 12',
#  u'',
#  u'Ball State Cardinals \r\n      Kyle Field, College Station, TX',
#  u'TBA',
#  u'Buy\r\nTickets'],
# ...

Thanks a TON, Eugene! I'm checking this out now, and will follow-up here; I really appreciate it! — alamo1836, May 29 '15 at 15:21
I just gave this one a shot, and I got a much better output than I was getting; thanks a ton man! [[u'2015 Texas A&M Aggies Football Schedule'], [u'Date', u'', u'Opponent', u'Time/TV', u'Tickets'], [u'SaturdaySep. 5', u'', u'Arizona State Sun Devils \r\n NRG Stadium, Houston, TX', u'7:00 p.m. CT\r\nESPN network', u'Buy\r\nTickets'], [u'SaturdaySep. 12', u'', u'Ball State Cardinals \r\n... I'm going to give the suggestion below a shot as well, as it looks like it cleans it up a bit more. But again, thanks a ton for your time and help; I appreciate it more than you know! — alamo1836, May 30 '15 at 23:29

score 0 · Accepted Answer · answered May 30 '15 at 06:57

Using BeautifulSoup and looking at the HTML of the page should be pretty straightforward as long as you have identified the tags you need. This is the code:

import urllib2
from bs4 import BeautifulSoup


def main():
    url = 'http://www.fbschedules.com/ncaa-15/sec/2015-texas-am-aggies-football-schedule.php'
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html)
    table = soup.find("table",{"class" : "cfb-sch"})
    # Working on the teams
    teams_td = table.findAll("td",{"class" : "cfb2"})
    teams = []
    for t in teams_td:
        teams.append(t.text.split('\r\n')[0].strip())
    # Working on the dates
    dates_td = table.findAll("td",{"class" : "cfb1"})
    dates = []
    # In the HTML table only 1 on 3 cfb1 is the date
    for i in range(0,len(dates_td),3):
        dates.append(dates_td[i].text)

    # Print everytin
    for s in zip(dates, teams):
        print s

if __name__ == '__main__':
    main()

When you run it, you should get this:

(u'SaturdaySep. 5', u'Arizona State Sun Devils')
(u'SaturdaySep. 12', u'Ball State Cardinals')
(u'SaturdaySep. 19', u'Nevada Wolf Pack')
(u'SaturdaySep. 26', u'at Arkansas Razorbacks')
(u'SaturdayOct. 3', u'Mississippi State Bulldogs')
(u'SaturdayOct. 10', u'Open Date')
(u'SaturdayOct. 17', u'Alabama Crimson Tide')
(u'SaturdayOct. 24', u'at Ole Miss Rebels')
(u'SaturdayOct. 31', u'South Carolina Gamecocks')
(u'SaturdayNov. 7', u'Auburn Tigers')
(u'SaturdayNov. 14', u'Western Carolina Catamounts')
(u'SaturdayNov. 21', u'at Vanderbilt Commodores')
(u'Saturday\r\n    Nov. 28', u'at LSU Tigers')
(u'SaturdayDec. 5', u'SEC Championship Game')

I hope this will help you.

I forgot to say that you will need to do some cleaning on the date string to look a little bit prettier, but I will leave that to you :-) — gl051, May 30 '15 at 06:59
That worked beautifully, thank you very much! And it'll be good practice for me to try and figure out just how to do that, so that's what I'll get to work on. Thanks again, I really appreciate your help! — alamo1836, May 30 '15 at 23:35

Iterate and remove HTML from list elements in python

2 Answers2