2

Suppose I have the following HTML table:

<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>

I'd like to convert this table to JSON, potentially in the following format:

data= [
  { 
    Name: 'John',         
    Age: 28,
    License: 'Y',
    Amount: 12.30
  },
  { 
    Name: 'Kevin',         
    Age: 25,
    License: 'Y',
    Amount: 22.30
  },
  { 
    Name: 'Smith',         
    Age: 38,
    License: 'Y',
    Amount: 52.20
  },
  { 
    Name: 'Stewart',         
    Age: 21,
    License: 'N',
    Amount: 3.80
  }
];

I've seen another example that sort of does the above, which I found here. However, there are a couple of things that I can't get working given that answer. Those are:

  • It is limited to two rows on the table. If I add an additional row, I get an error:

print(json.dumps(OrderedDict(table_data))) ValueError: too many values to unpack (expected 2)

  • The header rows of the table are not taken into account.

This is my code so far:

html_data = """
<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>License</th>
    <th>Amount</th>
  </tr>
  <tr>
    <td>John</td>
    <td>28</td>
    <td>Y</td>
    <td>12.30</td>
  </tr>
  <tr>
    <td>Kevin</td>
    <td>25</td>
    <td>Y</td>
    <td>22.30</td>
  </tr>
  <tr>
    <td>Smith</td>
    <td>38</td>
    <td>Y</td>
    <td>52.20</td>
  </tr>
  <tr>
    <td>Stewart</td>
    <td>21</td>
    <td>N</td>
    <td>3.80</td>
  </tr>
</table>
"""

from bs4 import BeautifulSoup
from collections import OrderedDict
import json

table_data = [[cell.text for cell in row("td")]
                         for row in BeautifulSoup(html_data, features="lxml")("tr")]

print(json.dumps(OrderedDict(table_data)))

But I'm getting the following error:

print(json.dumps(OrderedDict(table_data))) ValueError: need more than 0 values to unpack

EDIT The answer below works perfectly if there is only one table in the HTML. What if there are two tables? For example:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>Rich</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>

If I plug this in the below code, only the first table is shown as the JSON output.

Adam
  • 2,384
  • 7
  • 29
  • 66

2 Answers2

4

You can use soup.find_all:

from bs4 import BeautifulSoup as soup
s = soup(html, 'html.parser').table
h, [_, *d] = [i.text for i in s.tr.find_all('th')], [[i.text for i in b.find_all('td')] for b in s.find_all('tr')]
result = [dict(zip(h, i)) for i in d]

Output:

[{'Name': 'John', 'Age': '28', 'License': 'Y', 'Amount': '12.30'}, {'Name': 'Kevin', 'Age': '25', 'License': 'Y', 'Amount': '22.30'}, {'Name': 'Smith', 'Age': '38', 'License': 'Y', 'Amount': '52.20'}, {'Name': 'Stewart', 'Age': '21', 'License': 'N', 'Amount': '3.80'}]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thank you so much! This also does exactly what I wanted. I have only "approved" the other answer as the output was shown in a friendlier way. – Adam Jan 29 '20 at 14:27
3

This code does exactly what you want

from bs4 import BeautifulSoup
import json

xml_data = """
[[your xml data]]"""


if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for tr in model.table.find_all('tr', recursive=False):
        for th in tr.find_all('th', recursive=False):
            fields.append(th.text)
    for tr in model.table.find_all('tr', recursive=False):
        datum = {}
        for i, td in enumerate(tr.find_all('td', recursive=False)):
            datum[fields[i]] = td.text
        if datum:
            table_data.append(datum)

    print(json.dumps(table_data, indent=4))
Rugnar
  • 2,894
  • 3
  • 25
  • 29
  • How would you modify this if you had two different tables? If I put the two tables one after the other in the "xml_data" variable, it only returns the JSON output for the first table. – Adam Feb 19 '20 at 14:58
  • 1
    At the beginngin you do a loop `for table in model.find_all("table"):` and then replace every following occurene of `model` with `table` – Rugnar Feb 20 '20 at 18:40
  • Where do you mean "at the beginning"? Should this be directly before the first loop? Should it encompass the both for loops? I have edited my original post for more detail. Perhaps you could edit your answer directly? – Adam Feb 24 '20 at 13:25
  • I have created a new question here: https://stackoverflow.com/q/60379358/3480297 – Adam Feb 24 '20 at 15:41