Using BeautifulSoup to extract a table in Python 3

Question

I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a header row and multiple data rows.

I followed the answer to this question, but it appears updates to Python (or BeautifulSoup) require adjustments since it was posted 8 years ago. I think I have that mostly solved (see below), but in addition, the original answer seems to stop just short of actually structuring the data, instead outputting a list of header-data pairs.

I'd like to use a similar solution because it seems really close to what I need. My data is already parsed using BeautifulSoup so I'm specifically asking for a solution using that package rather than Pandas.

Reproducible Example

Altered from original question by adding a second row, as my data has many rows.

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
  <tr valign="top" class="Failure">
     <td>109</td>
     <td>35</td>
     <td>82.01%</td>
     <td>12 ms</td>
     <td>2 ms</td>
     <td>923 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print(datasets)

The result is supposed to look like the following (though with multiple rows, I'm not sure precisely the structure).

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

But instead looks like:

[<zip object at 0x7fb06b5efdc0>, <zip object at 0x7fb06b5ef980>]

Attempted Solution

I tried using datasets.append(tuple(dataset)) in the existing for loop, which resulted in:

[(('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
(('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))]

This is closer to the original answer's expected output, but obviously duplicates the pairs rather than creating a data table with headers and values. So I'm not sure what to do with the data from this point.

So this code works fine and correctly extracts the data. Now you just need to prepend a row with headers and convert each row in such a way that only the last element of each tuple in the form of `('Tests', '103')` remains, right? — ForceBru, Aug 12 '20 at 16:18
Yes, @ForceBru that sounds like what I'm trying to accomplish. — Roxanne Ready, Aug 12 '20 at 17:27

score 2 · Accepted Answer · answered Aug 12 '20 at 18:07

So you already have this:

datasets = [
  (('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
  (('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))
]

Here's how you can transform it. Assuming all rows are the same, you can extract headers from the first row:

headers_row = [hdr for hdr, data in datasets[0]]

Now, extract the second field of each tuple like ('Tests', '103') in each row:

processed_rows = [
  [data for hdr, data in row]
  for row in datasets
]
# [['103', '24', '76.70%', '71 ms', '0 ms', '829 ms'], ['109', '35', '82.01%', '12 ms', '2 ms', '923 ms']]

Now you have the header row and a list of processed_rows. You can write them to a CSV file with the standard csv module.

A better solution may be to keep your original format and use csv.DictWriter.

Extract the headers into headers_row, as shown above.

Write the data:

import csv

with open('data.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames= headers_row)

    writer.writeheader()

    for row in datasets: # your original data
        writer.writerow(dict(row))

Here dict(datasets[0]), for example, is:

{'Tests': '103', 'Failures': '24', 'Success Rate': '76.70%', 'Average Time': '71 ms', 'Min Time': '0 ms', 'Max Time': '829 ms'}

Thank you for this detailed answer. Your inclusion of both the "processed_rows" method and the csv.DictWriter method really clarified what's going on with the code. The csv.DictWriter solution works beautifully. — Roxanne Ready, Aug 12 '20 at 19:36

Mike67 · Answer 2 · 2020-08-12T19:10:08.680

At the end, just convert your zip iterator to a list:

for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(list(dataset))  # process iterator to list

print(datasets)

Final Output:

[[('Tests', '103'), 
('Failures', '24'), 
('Success Rate', '76.70%'), 
('Average Time', '71 ms'), 
('Min Time', '0 ms'), 
('Max Time', '829 ms')], 

[('Tests', '109'), 
('Failures', '35'), 
('Success Rate', '82.01%'), 
('Average Time', '12 ms'), 
('Min Time', '2 ms'), 
('Max Time', '923 ms')]]

If you want to convert the dataset to a csv string, use this code:

# convert to csv string

hdrline = ','.join(e[0] for e in datasets[0]) + "\n"
data = ""
for rw in datasets:
    data += ','.join([e[1] for e in rw]) + "\n"
    
csvstr = hdrline + data

print(csvstr)

Output:

Tests,Failures,Success Rate,Average Time,Min Time,Max Time
103,24,76.70%,71 ms,0 ms,829 ms
109,35,82.01%,12 ms,2 ms,923 ms

How do I then process this into a csv with a header row and data rows? — Roxanne Ready, Aug 12 '20 at 17:57

Amael · Answer 3 · 2020-08-12T21:11:39.623

1

If you are using the standard csv module, then you don't need to associate values with their labels

You can do the following, assuming you have a csvwriter which can be obtained via https://docs.python.org/3.8/library/csv.html#csv.writer

import csv
...

with open('file.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile) # You may add options here to format your csv file as needed

    headings = [th.get_text() for th in table.find("tr").find_all("th")]

    csvwriter.writerow(headings)

    for row in table.find_all("tr")[1:]:
        data = (td.get_text() for td in row.find_all("td"))
        csvwriter.writerow(data)

edited Aug 12 '20 at 21:11

answered Aug 12 '20 at 16:29

Amael

36
1
5

I am using ```csv``` but this code doesn't run. The syntax is ```csv.writer``` not ```csvwriter``` but making that adjustment results in other syntax errors. Can you be more explicit with how to make this work? – Roxanne Ready Aug 12 '20 at 17:59

Using BeautifulSoup to extract a table in Python 3

Reproducible Example

Attempted Solution

3 Answers3

Linked