So I have a few strings that I am pulling from IMDb's award pages:
<table><tr><td><big>Academy Awards, USA</big> </td> </tr> <tr> <th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th> </tr> <tr> <td rowspan="11" align="center" valign="middle"> 1978 </td> <td rowspan="7" align="center" valign="middle"><b>Won</b></td> <td rowspan="6" align="center" valign="middle">Oscar</td> <td valign="top"> Best Art Direction-Set Decoration John Barry Norman Reynolds Leslie Dilley Roger Christian <small> </small> </td> </tr> <tr> <td valign="top"> Best Costume Design John Mollo <small> </small> </td> </tr> <tr> <td valign="top"> Best Effects, Visual Effects John Stears John Dykstra Richard Edlund Grant McCune Robert Blalack <small> </small> </td> </tr> <tr> <td valign="top"> Best Film Editing Paul Hirsch Marcia Lucas Richard Chew <small> </small> </td> </tr> <tr> <td valign="top"> Best Music, Original Score John Williams <small> </small> </td> </tr> <tr> <td valign="top"> Best Sound Don MacDougall Ray West Bob Minkler Derek Ball <small> Derek Ball was not present at the awards ceremony. </small> </td> </tr> <tr> <td rowspan="1" align="center" valign="middle">Special Achievement Award</td> <td valign="top"> Ben Burtt (as Benjamin Burtt Jr.) <small> For sound effects. (For the creation of the alien, creature and robot voices.) </small> </td> </tr> <tr> <td rowspan="4" align="center" valign="middle"><b>Nominated</b></td> <td rowspan="4" align="center" valign="middle">Oscar</td> <td valign="top"> Best Actor in a Supporting Role Alec Guinness <small> </small> </td> </tr> <tr> <td valign="top"> Best Director George Lucas <small> </small> </td> </tr> <tr> <td valign="top"> Best Picture Gary Kurtz <small> </small> </td> </tr> <tr> <td valign="top"> Best Writing, Screenplay Written Directly for the Screen George Lucas <small> </small> </td> </tr> <tr> </tr></table>
I want to pull the headers (Year, Result, Award, and Category/Recipient) to a list and then each of the columns to their own list, respectively. For example (using the Academy Award table)(Website for reference: http://www.imdb.com/title/tt0076759/awards):
Columns = {"Year", "Result", "Award", "Category/Recipient"}
Years = {"1978", "1978", "1978", "1978", "1978", "1978", "1978"}
Results = {"Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Special Achievement Award"}
Categories/Recipients = {"Best Art Direction-Set Decoration (John Barry, Norman Reynolds, Leslie Dilley, Roger Christian)", "Best Costume Design (John Mollo)", "Best Effects, Visual Effects (John Stears, John Dykstra, Richard Edlund, Grant McCune, Robert Blalack)", Best Film Editing (Paul Hirsch, Marcia Lucas, Richard Chew)", "Best Music, Original Score (John Williams)", "Best Sound (Don MacDougall, Ray West, Bob Minkler, Derek Ball)", "(Ben Burtt (as Benjamin Burtt Jr.))"}
As you can see, I removed the unnecessary spacing from the table and put all the names in parenthesis. There are tags around all of the names, but I removed them (they can be kept in if it helps putting them in parenthesis easier). I also have the same number of items in each of the lists except for the Columns list.
Here is my current script so you know how I manipulate it already:
import shutil
import urllib2
import re
from lxml import etree
award_usock = urllib2.urlopen('http://www.imdb.com/title/tt0076759' + '/awards')
award_html = award_usock.read()
award_usock.close()
if "<big>" in award_html:
for a_show in re.finditer('<big>',award_html):
award_show_full_end = award_html.find('<td colspan="4"> </td>',a_show.end())
award_show_full = award_html[a_show.start():award_show_full_end]
award_show_full = award_show_full.replace('\n','')
# award_show_full = award_show_full.replace(' ','')
award_show_full = award_show_full.replace('</a>','')
award_show_full = award_show_full.replace('<br />','')
award_show_full = re.sub('<a href="/name/[^>]*>', '', award_show_full)
award_show_full = re.sub('<a href="/title/[^>]*>', '', award_show_full)
for a_s_title in re.finditer('<a href="',award_show_full):
award_title_loc = award_show_full.find('<a href="')
award_title_end = award_show_full.find('">',award_title_loc+10)
award_title_del = award_show_full[award_title_loc:award_title_end+2]
award_show_full = award_show_full.replace(award_title_del,'')
award_show_full = '<table><tr><td>' + award_show_full.replace('<br>','') + '</tr></table>'
award_show_loc = award_html.find('>',a_show.end())
award_show_end = award_html.find('</a></big>',a_show.end())
award_show = award_html[award_show_loc+1:award_show_end]
award_show_table = etree.XML(award_show_full)
award_show_rows = iter(award_show_table)
award_show_headers = [award_show_col.text for award_show_col in next(award_show_rows)]
for award_show_row in award_show_rows:
award_show_values = [award_show_col.text for award_show_col in award_show_row]
print dict(zip(award_show_headers,award_show_values))
But this produces a result:
{None: 'Year'}
{None: ' 1978 '}
{None: ' Best Costume Design John Mollo '}
{None: ' Best Effects, Visual Effects John Stears John Dykstra Richard Edlund Grant McCune Robert Blalack '}
{None: ' Best Film Editing Paul Hirsch Marcia Lucas Richard Chew '}
{None: ' Best Music, Original Score John Williams '}
{None: ' Best Sound Don MacDougall Ray West Bob Minkler Derek Ball '}
{None: 'Special Achievement Award'}
{None: None}
{None: ' Best Director George Lucas '}
{None: ' Best Picture Gary Kurtz '}
{None: ' Best Writing, Screenplay Written Directly for the Screen George Lucas '}
{}