Suppose I have the following HTML table:
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
I'd like to convert this table to JSON, potentially in the following format:
data= [
{
Name: 'John',
Age: 28,
License: 'Y',
Amount: 12.30
},
{
Name: 'Kevin',
Age: 25,
License: 'Y',
Amount: 22.30
},
{
Name: 'Smith',
Age: 38,
License: 'Y',
Amount: 52.20
},
{
Name: 'Stewart',
Age: 21,
License: 'N',
Amount: 3.80
}
];
I've seen another example that sort of does the above, which I found here. However, there are a couple of things that I can't get working given that answer. Those are:
- It is limited to two rows on the table. If I add an additional row, I get an error:
print(json.dumps(OrderedDict(table_data))) ValueError: too many values to unpack (expected 2)
- The header rows of the table are not taken into account.
This is my code so far:
html_data = """
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
"""
from bs4 import BeautifulSoup
from collections import OrderedDict
import json
table_data = [[cell.text for cell in row("td")]
for row in BeautifulSoup(html_data, features="lxml")("tr")]
print(json.dumps(OrderedDict(table_data)))
But I'm getting the following error:
print(json.dumps(OrderedDict(table_data))) ValueError: need more than 0 values to unpack
EDIT The answer below works perfectly if there is only one table in the HTML. What if there are two tables? For example:
<html>
<body>
<h1>My Heading</h1>
<p>Hello world</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>Rich</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
</body>
</html>
If I plug this in the below code, only the first table is shown as the JSON output.