I've seen several examples of this online on how to convert HTML content to JSON, but I'm unable to get to an actual result.
Suppose I have the following html_content:
<html>
<body>
<h1>My Heading</h1>
<p>Hello world</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
</body>
</html>
As you can see, this contains a heading, paragraph and table elements. I am trying to convert the above to JSON and output the result to a separate file, with correct formatting. This is my code:
import sys
import json
jsonD = json.dumps(html_content, sort_keys=True, indent=4)
sys.stdout=open("output.json","w")
print (jsonD)
sys.stdout.close()
The result is:
"\n<html>\n\t<body>\n\t\t<h1>My Heading</h1>\n\t\t<p>Hello world</p>\n\t\t<table>\n\t\t\t<tr>\n\t\t\t\t<th>Name</th>\n\t\t\t\t<th>Age</th>\n\t\t\t\t<th>License</th>\n\t\t\t\t<th>Amount</th>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>John</td>\n\t\t\t\t<td>28</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>12.30</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Kevin</td>\n\t\t\t\t<td>25</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>22.30</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Smith</td>\n\t\t\t\t<td>38</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>52.20</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Stewart</td>\n\t\t\t\t<td>21</td>\n\t\t\t\t<td>N</td>\n\t\t\t\t<td>3.80</td>\n\t\t\t</tr>\n\t\t</table>\n\t</body>\n</html>\n"
As you can see, the result is not escaping any of the return or tab characters and is making the output seem like one long string. How can I rectify this so that the output is correctly formatting from a JSON perspective?