Scraping PFR Football Data with Python for a Beginner

Question

background: i'm trying to scrape some tables from this pro-football-reference page. I'm a complete newbie to Python, so a lot of the technical jargon ends up lost on me but in trying to understand how to solve the issue, i can't figure it out.

specific issue: because there are multiple tables on the page, i can't figure out how to get python to target the one i want. I'm trying to get the Defense & Fumbles table. The code below is what i've got so far, and it's from this tutorial using a page from the same site- but one that only has a single table.

sample code:

#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"

#html from the given url
html=urlopen(url)

# make soup object of html
soup = BeautifulSoup(html)

# we see that soup is a beautifulsoup object
type(soup) 

#
column_headers = [th.getText() for th in 
                  soup.findAll('table', {"id": "defense").findAll('th')]

column_headers #our column headers

attempts made: I realized that the tutorial's method would not work for me, so i attempted to change the soup.findAll portion to target the specific table. But i repeatedly get an error saying:

AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

when changing it to find, the error becomes:

AttributeError: 'NoneType' object has no attribute 'find'

I'll be absolutely honest that i have no idea what i'm doing or what these mean. I'd appreciate any help in figuring how to target that data and then scrape it.

Thank you,

Does this answer your question? [Beautiful Soup: 'ResultSet' object has no attribute 'find\_all'?](https://stackoverflow.com/questions/24108507/beautiful-soup-resultset-object-has-no-attribute-find-all) — AMC, Mar 22 '20 at 22:38

score 0 · Answer 1 · answered Jan 12 '18 at 01:17

0

your missing a "}" in the dict after the word "defense". Try below and see if it works.

column_headers = [th.getText() for th in soup.findAll('table', {"id": "defense"}).findAll('th')]

answered Jan 12 '18 at 01:17

ZF007

3,708
8
29
48

unfortunately that doesn't solve the issue, i'm still seeing the same error responses. – Edward Gorelik Jan 12 '18 at 01:23
check Nathans answer for the remaining..he beat me to it ;-) – ZF007 Jan 12 '18 at 01:26

score 0 · Answer 2 · answered Jan 12 '18 at 01:17

First off, you want to use soup.find('table', {"id": "defense"}).findAll('th') - find one table, then find all of its 'th' tags.

The other problem is that the table with id "defense" is commented out in the html on that page:

<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_defense">
  <table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense &amp; Fumbles Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>

etc. I assume that javascript is un-hiding it. BeautifulSoup doesn't parse the text of comments, so you'll need to find the text of all the comments on the page as in this answer, look for one with id="defense" in it, and then feed the text of that comment into BeautifulSoup.

Like this:

from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

Hi, thanks for this response. So i ran what you said and after checking what is in defenseSoup it gives a very hard to read text, i'm assuming that's because all the HTML was turned into text right? My original plan was to turn this into a data frame with pandas using the instructions outlined in the original tutorial, but in this scenario it doesn't look like that would work. I tried running the original column_headers = soup.find statement on defenseSoup.find but that's giving me the nonetype error with this output so i'm unsure what would be my path from here. Any advice? — Edward Gorelik, Jan 12 '18 at 02:07
You're going have to do some more work to turn an html table into a dataframe. At the very least you need to do something like `defenses up.findAll('tr')` to find all the rows and then for each of those `tr.findAll(`td`)` to get the cells. It takes some figuring out, but it's worth learning :) — Nathan Vērzemnieks, Jan 12 '18 at 05:12

Scraping PFR Football Data with Python for a Beginner

2 Answers2