Scraping with BeautifulSoup: want to scrape entire column including header and title rows

Question

I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site using Python.

With the code below I can get the first row of all the Columns data that I want. However, is there a way I could include the header and row Titles to these?

I know I have the Headers, but I was wondering if there is a way to include these in the data that is outputted? And, also how could I look to include all the rows?

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append(headers.index(th))

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells= row.findAll('td')
    for column in desired_columns:
        print(cells[column].text)

score 1 · Accepted Answer · answered Jun 10 '15 at 05:36

1

How's this?

I added th.getText() and created a list on the desired columns which pulled the column name, and then added row_name = row.findNext('th').getText() to get the row.

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append([headers.index(th), th.getText()])

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells = row.findAll('td')
    row_name = row.findNext('th').getText()
    for column in desired_columns:
        print(cells[column[0]].text, row_name, column[1])

answered Jun 10 '15 at 05:36

double_j

1,636
1
18
27

Thanks. However, I tried the code you suggested and an Error pops up at `row_name = row.findNext('th').getText()` saying `'NoneType' object has no attribute 'getText'`? – user131983 Jun 10 '15 at 12:07
@user131983 That's strange.. Code works perfect for me, example: `0.3487 2015-06-05 SVENY01`. You have the code copied exactly from what I wrote? – double_j Jun 10 '15 at 12:13
Sorry you're right. I just realized that I actually changed my code to use `urllib2` rather than `urllib` as I am now running this code on my laptop and have Python 2.7. So, now I do `url = "http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html" content = urllib2.urlopen(url).read() soup = BeautifulSoup(content)`. However, this results in the output: `(u'0.3487', u'2015-06-05', u'SVENY01')`. Would you know why this is? Thanks! – user131983 Jun 10 '15 at 13:47
1

@user131983 Here's the answer to that: http://stackoverflow.com/questions/11279331/what-does-the-u-symbol-mean-in-front-of-string-values - The reason it's now showing is cause you're switching from Python 3.x (where it won't show) to 2.7 (where it is now showing). – double_j Jun 10 '15 at 13:55
Thanks a lot. One further question, did you get an `IndexError :list index out of range` for the line `cells[column[0]].text.encode('ascii', 'ignore'), row_name.encode('ascii', 'ignore'), column[1].encode('ascii', 'ignore')` ? I appear to be getting that. – user131983 Jun 10 '15 at 14:11
I didn't get that... Did the code change at all beside you adding `.encode('ascii', 'ignore')` to each var? – double_j Jun 10 '15 at 16:27

Scraping with BeautifulSoup: want to scrape entire column including header and title rows

1 Answers1