I am trying to grab some information from a table on the site http://www.house.gov/representatives/ Specifically, I want to get information on representatives from the "Representative Directory By Last Name" tables. So far, I am able to download the HTML from the site and write it to a file, but when using bs4 to parse and grab the specific tables I want, it is only grabbing the first row of each table.
This is because there is an extra tag in each row of the HTML table:
<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<BR>Armed Services<BR>Science, Space, and Technology</td>
</td>
</tr>
That last /td tag is somehow causing bs4 to not grab the rest of the rows. I did test manually going in and deleting some of the extra tags and I got back all the rows, so I know that extra tag is the problem. Here is my python code so far:
import bs4, requests
res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)
I've also tried to using prettify() but that did not help either. Any ideas on how to clean up the HTML so I can use bs4 (or something else) to parse and extract the tables I need?
Thanks!