Extra HTML tag causing problems with bs4

Question

I am trying to grab some information from a table on the site http://www.house.gov/representatives/ Specifically, I want to get information on representatives from the "Representative Directory By Last Name" tables. So far, I am able to download the HTML from the site and write it to a file, but when using bs4 to parse and grab the specific tables I want, it is only grabbing the first row of each table.

This is because there is an extra tag in each row of the HTML table:

<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<BR>Armed Services<BR>Science, Space, and Technology</td>
</td>
</tr>

That last /td tag is somehow causing bs4 to not grab the rest of the rows. I did test manually going in and deleting some of the extra tags and I got back all the rows, so I know that extra tag is the problem. Here is my python code so far:

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)

I've also tried to using prettify() but that did not help either. Any ideas on how to clean up the HTML so I can use bs4 (or something else) to parse and extract the tables I need?

Thanks!

Tiny.D · Accepted Answer · 2017-04-29T16:24:05.720

You could use the lxml parser instead of html.parser in your code :

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'lxml') #use the `lxml` parser instead of `html.parser`
table = soup.findAll("table",{"title":"Representative Directory By Last Name"})
print(table[0]) #print first table

The output will show you the full first table with "title" = "Representative Directory By Last Name":

<table class="directory" title="Representative Directory By Last Name">
<colgroup>
<col class="name"></col>
<col class="dist2"></col>
<col class="part"></col>
<col class="room"></col>
<col class="phone2"></col>
<col class="comm2"></col>
</colgroup>
<thead>
<tr>
<th>Name</th>
<th>District</th>
<th>Party</th>
<th>Room</th>
<th>Phone</th>
<th>Committee Assignment</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<br/>Armed Services<br/>Science, Space, and Technology</td>
</tr>
<tr>
<td><a href="http://adams.house.gov">
Adams, Alma </a>
</td>
<td>North Carolina 12th District</td>
<td>D</td>
<td>222 CHOB</td>
<td>202-225-1510</td>
<td>Agriculture<br/>Education and the Workforce<br/>Small Business</td>
</tr>
<tr>
<td><a href="https://aderholt.house.gov/">
Aderholt, Robert </a>
</td>
<td>Alabama 4th District</td>
<td>R</td>
<td>235 CHOB</td>
<td>202-225-4876</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://aguilar.house.gov/">
Aguilar, Pete </a>
</td>
<td>California 31st District</td>
<td>D</td>
<td>1223 LHOB</td>
<td>202-225-3201</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="http://allen.house.gov">
Allen, Rick </a>
</td>
<td>Georgia 12th District</td>
<td>R</td>
<td>426 CHOB</td>
<td>202-225-2823</td>
<td>Agriculture<br/>Education and the Workforce</td>
</tr>
<tr>
<td><a href="https://amash.house.gov/">
Amash, Justin </a>
</td>
<td>Michigan 3rd District</td>
<td>R</td>
<td>114 CHOB</td>
<td>202-225-3831</td>
<td>Oversight and Government</td>
</tr>
<tr>
<td><a href="https://amodei.house.gov">
Amodei, Mark </a>
</td>
<td>Nevada 2nd District</td>
<td>R</td>
<td>332 CHOB</td>
<td>202-225-6155</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://arrington.house.gov">
Arrington, Jodey  </a>
</td>
<td>Texas 19th District</td>
<td>R</td>
<td>1029 LHOB</td>
<td>202-225-4005</td>
<td>Agriculture<br/>the Budget<br/>Veterans' Affairs</td>
</tr>
</tbody>
</table>

Thanks that worked! What is the difference in these parsers? Is it generally a better idea to use the lxml parser? — sparks11, Apr 30 '17 at 14:00
for the difference, maybe you can refer to this answer, it will give you more details http://stackoverflow.com/questions/25714417/beautiful-soup-and-table-scraping-lxml-vs-html-parser — Tiny.D, Apr 30 '17 at 15:01

Extra HTML tag causing problems with bs4

1 Answers1