38

My local airport disgracefully blocks users without IE, and looks awful. I want to write a Python scripts that would get the contents of the Arrival and Departures pages every few minutes, and show them in a more readable manner.

My tools of choice are mechanize for cheating the site to believe I use IE, and BeautifulSoup for parsing page to get the flights data table.

Quite honestly, I got lost in the BeautifulSoup documentation, and can't understand how to get the table (whose title I know) from the entire document, and how to get a list of rows from that table.

Any ideas?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Adam Matan
  • 128,757
  • 147
  • 397
  • 562

3 Answers3

56

This is not the specific code you need, just a demo of how to work with BeautifulSoup. It finds the table who's id is "Table1" and gets all of its tr elements.

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1") 
rows = table.findAll(lambda tag: tag.name=='tr')
PiperWarrior
  • 191
  • 1
  • 13
Ofri Raviv
  • 24,375
  • 3
  • 55
  • 55
  • 3
    You can even chain find commands inside the lambda (to better filter, since I had multiple tables but they didn't have IDs) ! `table = soup.find(lambda tag: tag.name=='table' and tag.find(lambda ttag: ttag.name=='th' and ttag.text=='Common Name'))` – nmz787 Dec 08 '17 at 20:04
  • 3
    FYI, "has_key" is now deprecated. Use has_attr("id") instead. I will edit the original response as well. – PiperWarrior Mar 10 '18 at 04:10
19

Here is a working example for a generic <table>. (Though not using your page due javascript execution needed to load table data)

Extracting the table data from here GDP (Gross Domestic Product) by countries.

from bs4 import BeautifulSoup as Soup
html = ... # read your html with urllib/requests etc.
soup = BeautifulSoup(html, parser='lxml')

htmltable = soup.find('table', { 'class' : 'table table-striped' })
# where the dictionary specify unique attributes for the 'table' tag

Bellow the function parses a html segment started with tag <table> followed by multiple <tr> (table rows) and inner <td> (table data) tags. It returns a list of rows with inner columns. Accepts only one <th> (table header/data) in the first row.

def tableDataText(table):    
    """Parses a html segment started with tag <table> followed 
    by multiple <tr> (table rows) and inner <td> (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one <th> (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td') ) # data row       
    return rows

Using it we get (first two rows).

list_table = tableDataText(htmltable)
list_table[:2]

[['Rank',
  'Name',
  "GDP (IMF '19)",
  "GDP (UN '16)",
  'GDP Per Capita',
  '2019 Population'],
 ['1',
  'United States',
  '21.41 trillion',
  '18.62 trillion',
  '$65,064',
  '329,064,917']]

That can be easily transformed in a pandas.DataFrame for more advanced manipulation.

import pandas as pd

dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(4)

enter image description here

imbr
  • 6,226
  • 4
  • 53
  • 65
18
soup = BeautifulSoup(HTML)

# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )

rows=list()
for row in table.findAll("tr"):
   rows.append(row)

# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times
goggin13
  • 7,876
  • 7
  • 29
  • 44
  • 1
    any ideas how to go to a specific table when there are no id or title to differentiate... for example.. I want the third table with in the html file... (there are no other indicators). – ihightower Jun 08 '12 at 12:11
  • 7
    @ihightower: `soup.find('table')[2]` would get you the third `table`. (You'd want to check the length before doing this though, just to be safe.) – hamstu Sep 13 '13 at 17:27