0

I would like to use BeautifulSoup to extract a table from a website and store it as structured data. The final output I require is something that can be exported to a .csv with a header row and multiple data rows.

I followed the answer to this question, but it appears updates to Python (or BeautifulSoup) require adjustments since it was posted 8 years ago. I think I have that mostly solved (see below), but in addition, the original answer seems to stop just short of actually structuring the data, instead outputting a list of header-data pairs.

I'd like to use a similar solution because it seems really close to what I need. My data is already parsed using BeautifulSoup so I'm specifically asking for a solution using that package rather than Pandas.

Reproducible Example

Altered from original question by adding a second row, as my data has many rows.

from bs4 import BeautifulSoup

html = """
  <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%">
    <tr valign="top">
      <th>Tests</th>
      <th>Failures</th>
      <th>Success Rate</th>
      <th>Average Time</th>
      <th>Min Time</th>
      <th>Max Time</th>
   </tr>
   <tr valign="top" class="Failure">
     <td>103</td>
     <td>24</td>
     <td>76.70%</td>
     <td>71 ms</td>
     <td>0 ms</td>
     <td>829 ms</td>
  </tr>
  <tr valign="top" class="Failure">
     <td>109</td>
     <td>35</td>
     <td>82.01%</td>
     <td>12 ms</td>
     <td>2 ms</td>
     <td>923 ms</td>
  </tr>
</table>"""

soup = BeautifulSoup(html)
table = soup.find("table", attrs={"class":"details"})

# The first tr contains the field names.
headings = [th.get_text() for th in table.find("tr").find_all("th")]

datasets = []
for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(dataset)

print(datasets)

The result is supposed to look like the following (though with multiple rows, I'm not sure precisely the structure).

[[(u'Tests', u'103'),
  (u'Failures', u'24'),
  (u'Success Rate', u'76.70%'),
  (u'Average Time', u'71 ms'),
  (u'Min Time', u'0 ms'),
  (u'Max Time', u'829 ms')]]

But instead looks like:

[<zip object at 0x7fb06b5efdc0>, <zip object at 0x7fb06b5ef980>]

Attempted Solution

I tried using datasets.append(tuple(dataset)) in the existing for loop, which resulted in:

[(('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
(('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))]

This is closer to the original answer's expected output, but obviously duplicates the pairs rather than creating a data table with headers and values. So I'm not sure what to do with the data from this point.

Roxanne Ready
  • 121
  • 1
  • 10
  • So this code works fine and correctly extracts the data. Now you just need to prepend a row with headers and convert each row in such a way that only the last element of each tuple in the form of `('Tests', '103')` remains, right? – ForceBru Aug 12 '20 at 16:18
  • Yes, @ForceBru that sounds like what I'm trying to accomplish. – Roxanne Ready Aug 12 '20 at 17:27

3 Answers3

2

So you already have this:

datasets = [
  (('Tests', '103'), ('Failures', '24'), ('Success Rate', '76.70%'), ('Average Time', '71 ms'), ('Min Time', '0 ms'), ('Max Time', '829 ms')), 
  (('Tests', '109'), ('Failures', '35'), ('Success Rate', '82.01%'), ('Average Time', '12 ms'), ('Min Time', '2 ms'), ('Max Time', '923 ms'))
]

Here's how you can transform it. Assuming all rows are the same, you can extract headers from the first row:

headers_row = [hdr for hdr, data in datasets[0]]

Now, extract the second field of each tuple like ('Tests', '103') in each row:

processed_rows = [
  [data for hdr, data in row]
  for row in datasets
]
# [['103', '24', '76.70%', '71 ms', '0 ms', '829 ms'], ['109', '35', '82.01%', '12 ms', '2 ms', '923 ms']]

Now you have the header row and a list of processed_rows. You can write them to a CSV file with the standard csv module.


A better solution may be to keep your original format and use csv.DictWriter.

  1. Extract the headers into headers_row, as shown above.

  2. Write the data:

    import csv
    
    with open('data.csv', 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames= headers_row)
    
        writer.writeheader()
    
        for row in datasets: # your original data
            writer.writerow(dict(row))
    

Here dict(datasets[0]), for example, is:

{'Tests': '103', 'Failures': '24', 'Success Rate': '76.70%', 'Average Time': '71 ms', 'Min Time': '0 ms', 'Max Time': '829 ms'}
ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • Thank you for this detailed answer. Your inclusion of both the "processed_rows" method and the csv.DictWriter method really clarified what's going on with the code. The csv.DictWriter solution works beautifully. – Roxanne Ready Aug 12 '20 at 19:36
1

At the end, just convert your zip iterator to a list:

for row in table.find_all("tr")[1:]:
    dataset = zip(headings, (td.get_text() for td in row.find_all("td")))
    datasets.append(list(dataset))  # process iterator to list

print(datasets)

Final Output:

[[('Tests', '103'), 
('Failures', '24'), 
('Success Rate', '76.70%'), 
('Average Time', '71 ms'), 
('Min Time', '0 ms'), 
('Max Time', '829 ms')], 

[('Tests', '109'), 
('Failures', '35'), 
('Success Rate', '82.01%'), 
('Average Time', '12 ms'), 
('Min Time', '2 ms'), 
('Max Time', '923 ms')]]

If you want to convert the dataset to a csv string, use this code:

# convert to csv string

hdrline = ','.join(e[0] for e in datasets[0]) + "\n"
data = ""
for rw in datasets:
    data += ','.join([e[1] for e in rw]) + "\n"
    
csvstr = hdrline + data

print(csvstr)

Output:

Tests,Failures,Success Rate,Average Time,Min Time,Max Time
103,24,76.70%,71 ms,0 ms,829 ms
109,35,82.01%,12 ms,2 ms,923 ms
Mike67
  • 11,175
  • 2
  • 7
  • 15
1

If you are using the standard csv module, then you don't need to associate values with their labels

You can do the following, assuming you have a csvwriter which can be obtained via https://docs.python.org/3.8/library/csv.html#csv.writer

import csv
...

with open('file.csv', 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile) # You may add options here to format your csv file as needed

    headings = [th.get_text() for th in table.find("tr").find_all("th")]

    csvwriter.writerow(headings)

    for row in table.find_all("tr")[1:]:
        data = (td.get_text() for td in row.find_all("td"))
        csvwriter.writerow(data)
Amael
  • 36
  • 1
  • 5
  • I am using ```csv``` but this code doesn't run. The syntax is ```csv.writer``` not ```csvwriter``` but making that adjustment results in other syntax errors. Can you be more explicit with how to make this work? – Roxanne Ready Aug 12 '20 at 17:59