0

So I have a few strings that I am pulling from IMDb's award pages:

<table><tr><td><big>Academy Awards, USA</big>        </td>      </tr>      <tr>        <th>Year</th><th>Result</th><th>Award</th><th>Category/Recipient(s)</th>      </tr>            <tr>        <td rowspan="11" align="center" valign="middle">          1978         </td>                          <td rowspan="7" align="center" valign="middle"><b>Won</b></td>                                                      <td rowspan="6" align="center" valign="middle">Oscar</td>                                      <td valign="top">          Best Art Direction-Set Decoration                                John Barry                                              Norman Reynolds                                              Leslie Dilley                                              Roger Christian                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Costume Design                                John Mollo                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Effects, Visual Effects                                John Stears                                              John Dykstra                                              Richard Edlund                                              Grant McCune                                              Robert Blalack                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Film Editing                                Paul Hirsch                                              Marcia Lucas                                              Richard Chew                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Music, Original Score                                John Williams                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Sound                                Don MacDougall                                              Ray West                                              Bob Minkler                                              Derek Ball                                                      <small>                                                                      Derek Ball was not present at the awards ceremony.          </small>        </td>      </tr>                                                      <tr>                                            <td rowspan="1" align="center" valign="middle">Special Achievement Award</td>                                      <td valign="top">                                          Ben Burtt             (as Benjamin Burtt Jr.)                                          <small>                                                                      For sound effects. (For the creation of the alien, creature and robot voices.)          </small>        </td>      </tr>                                                            <tr>                  <td rowspan="4" align="center" valign="middle"><b>Nominated</b></td>                                                      <td rowspan="4" align="center" valign="middle">Oscar</td>                                      <td valign="top">          Best Actor in a Supporting Role                                Alec Guinness                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Director                                George Lucas                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Picture                                Gary Kurtz                                                      <small>                                                                                </small>        </td>      </tr>                                    <tr>                        <td valign="top">          Best Writing, Screenplay Written Directly for the Screen                                George Lucas                                                      <small>                                                                                </small>        </td>      </tr>                                                  <tr>        </tr></table>

I want to pull the headers (Year, Result, Award, and Category/Recipient) to a list and then each of the columns to their own list, respectively. For example (using the Academy Award table)(Website for reference: http://www.imdb.com/title/tt0076759/awards):

Columns = {"Year", "Result", "Award", "Category/Recipient"}
Years = {"1978", "1978", "1978", "1978", "1978", "1978", "1978"}
Results = {"Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Oscar", "Special Achievement Award"}
Categories/Recipients = {"Best Art Direction-Set Decoration (John Barry, Norman Reynolds, Leslie Dilley, Roger Christian)", "Best Costume Design (John Mollo)", "Best Effects, Visual Effects (John Stears, John Dykstra, Richard Edlund, Grant McCune, Robert Blalack)", Best Film Editing (Paul Hirsch, Marcia Lucas, Richard Chew)", "Best Music, Original Score (John Williams)", "Best Sound (Don MacDougall, Ray West, Bob Minkler, Derek Ball)", "(Ben Burtt (as Benjamin Burtt Jr.))"}

As you can see, I removed the unnecessary spacing from the table and put all the names in parenthesis. There are tags around all of the names, but I removed them (they can be kept in if it helps putting them in parenthesis easier). I also have the same number of items in each of the lists except for the Columns list.

Here is my current script so you know how I manipulate it already:

import shutil
import urllib2
import re
from lxml import etree

award_usock = urllib2.urlopen('http://www.imdb.com/title/tt0076759' + '/awards')
award_html = award_usock.read()
award_usock.close()
if "<big>" in award_html:
    for a_show in re.finditer('<big>',award_html):
        award_show_full_end = award_html.find('<td colspan="4">&nbsp;</td>',a_show.end())
        award_show_full = award_html[a_show.start():award_show_full_end]
        award_show_full = award_show_full.replace('\n','')
        # award_show_full = award_show_full.replace('  ','')
        award_show_full = award_show_full.replace('</a>','')
        award_show_full = award_show_full.replace('<br />','')
        award_show_full = re.sub('<a href="/name/[^>]*>',  '', award_show_full)
        award_show_full = re.sub('<a href="/title/[^>]*>',  '', award_show_full)
        for a_s_title in re.finditer('<a href="',award_show_full):
            award_title_loc = award_show_full.find('<a href="')
            award_title_end = award_show_full.find('">',award_title_loc+10)
            award_title_del = award_show_full[award_title_loc:award_title_end+2]
            award_show_full = award_show_full.replace(award_title_del,'')
        award_show_full = '<table><tr><td>' + award_show_full.replace('<br>','') + '</tr></table>'
        award_show_loc = award_html.find('>',a_show.end())
        award_show_end = award_html.find('</a></big>',a_show.end())
        award_show = award_html[award_show_loc+1:award_show_end]
        award_show_table = etree.XML(award_show_full)
        award_show_rows = iter(award_show_table)
        award_show_headers = [award_show_col.text for award_show_col in next(award_show_rows)]
        for award_show_row in award_show_rows:
            award_show_values = [award_show_col.text for award_show_col in award_show_row]
            print dict(zip(award_show_headers,award_show_values))

But this produces a result:

{None: 'Year'}
{None: '          1978         '}
{None: '          Best Costume Design                                John Mollo                                                      '}
{None: '          Best Effects, Visual Effects                                John Stears                                              John Dykstra                                              Richard Edlund                                              Grant McCune                                              Robert Blalack                                                      '}
{None: '          Best Film Editing                                Paul Hirsch                                              Marcia Lucas                                              Richard Chew                                                      '}
{None: '          Best Music, Original Score                                John Williams                                                      '}
{None: '          Best Sound                                Don MacDougall                                              Ray West                                              Bob Minkler                                              Derek Ball                                                      '}
{None: 'Special Achievement Award'}
{None: None}
{None: '          Best Director                                George Lucas                                                      '}
{None: '          Best Picture                                Gary Kurtz                                                      '}
{None: '          Best Writing, Screenplay Written Directly for the Screen                                George Lucas                                                      '}
{}
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
rjbogz
  • 860
  • 1
  • 15
  • 36
  • please post your regular expression – hago Aug 02 '13 at 01:47
  • 2
    FWIW, screen scraping IMDB violates their terms of service: "Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below." – tehsockz Aug 02 '13 at 01:49
  • Note that your hoped-for output isn't a bunch of lists, but a bunch of sets. And the set `{"1978", "1978", "1978", "1978", "1978", "1978", "1978"}` is exactly the same set as `{"1978")`, while `{"Year", "Result", "Award", "Category/Recipient"}` is exactly the same set as `{"Award", "Category/Recipient", "Year", "Result"}`. (And yes, this means zipping two sets is generally not as useful as zipping two lists.) – abarnert Aug 02 '13 at 01:54
  • @abarnert - Thanks for the clarification! I am aware about the {"1978"} set, but there are some with multiple years, so keeping it at the same length would just simplify it when I need to zip later. – rjbogz Aug 02 '13 at 01:59
  • @user1911612: That's the whole point: You're _not_ keeping it at the same length. `len({"1978", "1978", "1978"})` is not 3, it's 1. (Plus, again, the order of sets is completely arbitrary, so if you're zipping together two sets, it will match an arbitrary member of the first to an arbitrary member of the second, and so on.) – abarnert Aug 02 '13 at 02:58
  • You may want to use an API to access IMDb, or perhaps OMDb/Freebase, instead of trying to scrape it. Besides not being illegal (as tehsockz pointed out), it's also going to be a whole lot easier. Just call `json.loads` on the results and you'll get objects in formats similar to what you're trying unsuccessfully to build… Or, for IMDb, you can download the plain-text database over FTP, which is also pretty easy to parse locally. – abarnert Aug 02 '13 at 03:02

2 Answers2

3

It's not a good idea to parse HTML using regular expressions, better try using a parser, like Beautiful Soup.

Community
  • 1
  • 1
Óscar López
  • 232,561
  • 37
  • 312
  • 386
  • Just a quick question. I keep seeing people recommend Beautiful Soup. How does it compare to using lxml for xml parsing? – Vorticity Aug 02 '13 at 01:56
  • I have used beautiful soup before. This is but part of a quite massive script and I don't really want to change all of it. – rjbogz Aug 02 '13 at 01:56
  • 1
    @Vorticity that has been asked [before](http://stackoverflow.com/questions/4967103/beautifulsoup-and-lxml-html-what-to-prefer) :) – Óscar López Aug 02 '13 at 02:00
  • 1
    @user1911612: So you want to fix your code, but you don't want to change it? – Blender Aug 02 '13 at 02:06
  • @Blender - Good point, no I am all for changing it. Any ideas how to change it and make it work with beautiful soup though? – rjbogz Aug 02 '13 at 02:16
  • 2
    @Vorticity: `lxml` is a collection of multiple different APIs (ElementTree, ElementTree.iterparse, SAX, XPath, etc.) on top of the `libxml2` parser. BeautifulSoup (4.0 and later) can (and prefers to) use `lxml` as its parser, so it's just yet another API on top of the same functionality. So asking how they compare is a silly question. It depends on what you need to do, what style you want to write in, and what background knowledge you have. (The question Óscar linked to is about HTML, and about BS 3.x, which is actually a very different case…) – abarnert Aug 02 '13 at 03:10
1

You don't have to use an external library for parsing an HTML table if you are using python 3. You can use the built-in HTMLParser class from html.parser (it replaces SGMLParser from python 2). I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree or BeautifulSoup are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It maybe looks like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]
schmijos
  • 8,114
  • 3
  • 50
  • 58