This question is a further part to this answer. I am able to convert one HTML table into JSON, but when there are multiple tables with different headers, the results do not match.
For example, consider the following HTML content:
<html>
<body>
<h1>My Heading</h1>
<p>Hello world</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
<table>
<tr>
<th>Name2</th>
<th>Age2</th>
<th>License2</th>
<th>Amount2</th>
<th>Random</th>
</tr>
<tr>
<td>Rich</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
<td>2</td>
</tr>
<tr>
<td>Lou</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
<td>2</td>
</tr>
<tr>
<td>Harry</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
<td>2</td>
</tr>
<tr>
<td>Phil</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
<td>2</td>
</tr>
</table>
</body>
</html>
Notice how there are two different tables, with different headers, in addition to header and paragraph tags. I'd like to convert this table into JSON. However, with my below code,
from bs4 import BeautifulSoup
import json
if __name__ == '__main__':
model = BeautifulSoup(xml_data, features='lxml')
fields = []
table_data = []
for table in model.find_all("table"):
for tr in table.find_all('tr', recursive=False):
for th in tr.find_all('th', recursive=False):
fields.append(th.text)
for tr in table.find_all('tr', recursive=False):
datum = {}
for i, td in enumerate(tr.find_all('td', recursive=False)):
datum[fields[i]] = td.text
if datum:
table_data.append(datum)
print(json.dumps(table_data, indent=4))
I get the following output:
[
{
"Name": "John",
"Age": "28",
"License": "Y",
"Amount": "12.30"
},
{
"Name": "Kevin",
"Age": "25",
"License": "Y",
"Amount": "22.30"
},
{
"Name": "Smith",
"Age": "38",
"License": "Y",
"Amount": "52.20"
},
{
"Name": "Stewart",
"Age": "21",
"License": "N",
"Amount": "3.80"
},
{
"Name": "Rich",
"Age": "28",
"License": "Y",
"Amount": "12.30",
"Name2": "2"
},
{
"Name": "Lou",
"Age": "25",
"License": "Y",
"Amount": "22.30",
"Name2": "2"
},
{
"Name": "Harry",
"Age": "38",
"License": "Y",
"Amount": "52.20",
"Name2": "2"
},
{
"Name": "Phil",
"Age": "21",
"License": "N",
"Amount": "3.80",
"Name2": "2"
}
]
The output is incorrect as the header columns in both tables are different, and yet the header is outputted in the second set in the JSON as the same as the first. Also notice how the last column in the second table in the JSON is incorrect altogether. I'd expect the output to be:
[
{
"Name": "John",
"Age": "28",
"License": "Y",
"Amount": "12.30"
},
{
"Name": "Kevin",
"Age": "25",
"License": "Y",
"Amount": "22.30"
},
{
"Name": "Smith",
"Age": "38",
"License": "Y",
"Amount": "52.20"
},
{
"Name": "Stewart",
"Age": "21",
"License": "N",
"Amount": "3.80"
},
{
"Name2": "Rich",
"Age2": "28",
"License2": "Y",
"Amount2": "12.30",
"Random": "2"
},
{
"Name2": "Lou",
"Age2": "25",
"License2": "Y",
"Amount2": "22.30",
"Random": "2"
},
{
"Name2": "Harry",
"Age2": "38",
"License2": "Y",
"Amount2": "52.20",
"Random": "2"
},
{
"Name2": "Phil",
"Age2": "21",
"License2": "N",
"Amount2": "3.80",
"Random": "2"
}
]