Parsing webpage table with BeautifulSoup4

Question

So, I'm attempting to parse the table from a webpage using BeautifulSoup4 and it is able to get the webpage, and parse the content, but when I move onto looking for the table to put into a pandas data frame I get an attribute error: 'NONETYPE' object has no attribute 'Find_all'

I tried this same process for another webpage and it was able to work just fine, and I'm just trying to figure out what I'm doing incorrectly here where one works and the other does not.

#Imports
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

#Load data
url = 'https://gisopendata.siouxfalls.org/datasets/7b0407feca3e4f47bfe54559b9c1dd5d_13/data'

#Get request
web_data = requests.get(url)

#Parse Content
soup = BeautifulSoup(web_data.text, 'lxml')
#print(soup.prettify())


table = soup.find('table', {'class':'table table-striped table-bordered table-hover'})

headers = []

for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)

Data is dynamically pulled from a POST request to a different endpoint. — QHarr, May 13 '21 at 02:35
Does this answer your question? [pandas read\_html ValueError: No tables found](https://stackoverflow.com/questions/53398785/pandas-read-html-valueerror-no-tables-found) — rpanai, May 13 '21 at 03:41

score 0 · Answer 1 · answered May 13 '21 at 02:25

each table usually has a thead and tbody (and possibly a tr) which you need to access before you can use find_all on th.

If you check the html on the source page this is indeed the case, you have

<table class="table table-striped table-bordered table-hover" role="grid">
    <thead role="rowgroup">
      <tr role="row">
          <th id="ember123" class="ember-view">

So after the table tag, you have to access the thead tag, then the tr tag, then you can use find_all to gather all the ths

Can you try and see whether something like this works:

for i in table.find('thead').find('tr').find_all('th'):
    title = i.text.strip()
    headers.append(title)

The giveaway here is to observe carefully data in the source page, the AttributeError tells you clearly that BeautifulSoup cannot find the tag with the instructions you specified, hence the NoneType reference.

score 0 · Accepted Answer · answered May 13 '21 at 03:27

Data is dynamically pulled from a POST request. However, the page shows you an API endpoint you can use. The following is one way you can make a request to that API and generate a dataframe from the response.

Simplest is to use with json specified for output:

import requests
import pandas as pd

r = requests.get('https://gis2.siouxfalls.org/arcgis/rest/services/Data/Community/MapServer/13/query?where=1%3D1&outFields=*&outSR=4326&f=json').json()
print(pd.DataFrame([i['attributes'] for i in r['features']]))

Otherwise,

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('https://gis2.siouxfalls.org/arcgis/rest/services/Data/Community/MapServer/13/query?outFields=*&where=1%3D1')
soup = bs(r.content, 'lxml')
headers = ['OBJECTID', 'Id', 'DESCRIP', 'LOCATION', 'YEARBUILT', 'LOCAL_REGISTER', 'LOCAL_REG_DATE', 
           'NATIONAL_REGISTER', 'NATIONAL_REG_DATE', 'GlobalID', 'Shape_Length', 'Shape_Area']

data = {}

for header in headers:
    if header == 'OBJECTID':      
        data[header] = [i.next_sibling.next_sibling.text for i in soup.select(f'i:contains("{header}")')]
    else:
        data[header] = [i.next_sibling for i in soup.select(f'i:contains("{header}")')]

df = pd.DataFrame(zip(*data.values()), columns = headers)

print(df)

Parsing webpage table with BeautifulSoup4

2 Answers2