0

This question is a further part to this answer. I am able to convert one HTML table into JSON, but when there are multiple tables with different headers, the results do not match.

For example, consider the following HTML content:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
        <table>
            <tr>
                <th>Name2</th>
                <th>Age2</th>
                <th>License2</th>
                <th>Amount2</th>
                <th>Random</th>
            </tr>
            <tr>
                <td>Rich</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Lou</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Harry</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Phil</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
                <td>2</td>
            </tr>
        </table>
    </body>
</html>

Notice how there are two different tables, with different headers, in addition to header and paragraph tags. I'd like to convert this table into JSON. However, with my below code,

from bs4 import BeautifulSoup
import json

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for table in model.find_all("table"):
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))

I get the following output:

[
    {
        "Name": "John",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30"
    },
    {
        "Name": "Kevin",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30"
    },
    {
        "Name": "Smith",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20"
    },
    {
        "Name": "Stewart",
        "Age": "21",
        "License": "N",
        "Amount": "3.80"
    },
    {
        "Name": "Rich",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30",
        "Name2": "2"
    },
    {
        "Name": "Lou",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30",
        "Name2": "2"
    },
    {
        "Name": "Harry",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20",
        "Name2": "2"
    },
    {
        "Name": "Phil",
        "Age": "21",
        "License": "N",
        "Amount": "3.80",
        "Name2": "2"
    }
]

The output is incorrect as the header columns in both tables are different, and yet the header is outputted in the second set in the JSON as the same as the first. Also notice how the last column in the second table in the JSON is incorrect altogether. I'd expect the output to be:

[
    {
        "Name": "John",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30"
    },
    {
        "Name": "Kevin",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30"
    },
    {
        "Name": "Smith",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20"
    },
    {
        "Name": "Stewart",
        "Age": "21",
        "License": "N",
        "Amount": "3.80"
    },
    {
        "Name2": "Rich",
        "Age2": "28",
        "License2": "Y",
        "Amount2": "12.30",
        "Random": "2"
    },
    {
        "Name2": "Lou",
        "Age2": "25",
        "License2": "Y",
        "Amount2": "22.30",
        "Random": "2"
    },
    {
        "Name2": "Harry",
        "Age2": "38",
        "License2": "Y",
        "Amount2": "52.20",
        "Random": "2"
    },
    {
        "Name2": "Phil",
        "Age2": "21",
        "License2": "N",
        "Amount2": "3.80",
        "Random": "2"
    }
]
Adam
  • 2,384
  • 7
  • 29
  • 66

2 Answers2

0

I had to clear the "th" fields list after every iteration:

from bs4 import BeautifulSoup
import json

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for table in model.find_all("table"):
        fields.clear()
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))
Adam
  • 2,384
  • 7
  • 29
  • 66
0

The problem lies with the line

datum[fields[i]] = td.text

i is just the index of the enumerator, so it will always add fields to the JSON object in the order it first encountered them in the first inner loop. This means that it will use the headings from the first table first. You'll need to create a separate fields array for each table, which you can do simply by moving the declaration for fields inside the outer loop like so

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    table_data = []
    for table in model.find_all("table"):
        fields = []
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))

This should produce the desired output

dylan0d
  • 31
  • 2