I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a header row and multiple data rows.
I followed the answer to this question, but it appears updates to Python (or BeautifulSoup) require adjustments since it was posted 8 years ago. I think I have that mostly solved (see below), but in addition, the original answer seems to stop just short of actually structuring the data, instead outputting a list of header-data pairs.
I'd like to use a similar solution because it seems really close to what I need. My data is already parsed using BeautifulSoup so I'm specifically asking for a solution using that package rather than Pandas.
Reproducible Example
Altered from original question by adding a second row, as my data has many rows.
from bs4 import BeautifulSoup
html = """
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
<tr valign="top">
<th>Tests</th>
<th>Failures</th>
<th>Success Rate</th>
<th>Average Time</th>
<th>Min Time</th>
<th>Max Time</th>
</tr>
<tr valign="top" class="Failure">
<td>103</td>
<td>24</td>
<td>76.70%</td>
<td>71 ms</td>
<td>0 ms</td>
<td>829 ms</td>
</tr>
<tr valign="top" class="Failure">
<td>109</td>
<td>35</td>
<td>82.01%</td>
<td>12 ms</td>
<td>2 ms</td>
<td>923 ms</td>
</tr>
</table>"""
soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})
# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]
datasets = []
for row in table.find_all("tr")[1:]:
dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
datasets.append(dataset)
print(datasets)
The result is supposed to look like the following (though with multiple rows, I'm not sure precisely the structure).
[[(u'Tests', u'103'),
(u'Failures', u'24'),
(u'Success Rate', u'76.70%'),
(u'Average Time', u'71 ms'),
(u'Min Time', u'0 ms'),
(u'Max Time', u'829 ms')]]
But instead looks like:
[<zip object at 0x7fb06b5efdc0>, <zip object at 0x7fb06b5ef980>]
Attempted Solution
I tried using datasets.append(tuple(dataset))
in the existing for loop, which resulted in:
[(('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')),
(('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))]
This is closer to the original answer's expected output, but obviously duplicates the pairs rather than creating a data table with headers and values. So I'm not sure what to do with the data from this point.