0

I've seen several examples of this online on how to convert HTML content to JSON, but I'm unable to get to an actual result.

Suppose I have the following html_content:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>

As you can see, this contains a heading, paragraph and table elements. I am trying to convert the above to JSON and output the result to a separate file, with correct formatting. This is my code:

import sys
import json
jsonD = json.dumps(html_content, sort_keys=True, indent=4)

sys.stdout=open("output.json","w")
print (jsonD)
sys.stdout.close()

The result is:

"\n<html>\n\t<body>\n\t\t<h1>My Heading</h1>\n\t\t<p>Hello world</p>\n\t\t<table>\n\t\t\t<tr>\n\t\t\t\t<th>Name</th>\n\t\t\t\t<th>Age</th>\n\t\t\t\t<th>License</th>\n\t\t\t\t<th>Amount</th>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>John</td>\n\t\t\t\t<td>28</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>12.30</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Kevin</td>\n\t\t\t\t<td>25</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>22.30</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Smith</td>\n\t\t\t\t<td>38</td>\n\t\t\t\t<td>Y</td>\n\t\t\t\t<td>52.20</td>\n\t\t\t</tr>\n\t\t\t<tr>\n\t\t\t\t<td>Stewart</td>\n\t\t\t\t<td>21</td>\n\t\t\t\t<td>N</td>\n\t\t\t\t<td>3.80</td>\n\t\t\t</tr>\n\t\t</table>\n\t</body>\n</html>\n"

As you can see, the result is not escaping any of the return or tab characters and is making the output seem like one long string. How can I rectify this so that the output is correctly formatting from a JSON perspective?

Adam
  • 2,384
  • 7
  • 29
  • 66
  • What output are you expecting? – Alex W Feb 19 '20 at 16:04
  • This might be a helpful example to look at: http://www.xavierdupre.fr/blog/2013-10-27_nojs.html – Zachary Blackwood Feb 19 '20 at 16:10
  • @ZacharyBlackwood I've seen this example, but how do you import the HTMLtoJSONParser module? – Adam Feb 19 '20 at 16:13
  • @AlexW similar to the output I've put but without the "/n" and "/t" in between each element. Instead, it should actually return to a new line or indent as it's written. – Adam Feb 19 '20 at 16:14
  • @Adam In the case of that blog post, he actually created the HTMLtoJSONParser, it's not something he imported from somewhere else – Zachary Blackwood Feb 19 '20 at 16:15
  • @ZacharyBlackwood right of course. I'm not sure why I'm getting this error in the very first line though: 'class HTMLtoJSONParser(html.parser.HTMLParser): NameError: name 'html' is not defined' – Adam Feb 19 '20 at 16:19
  • Have you tried [this](https://stackoverflow.com/questions/1885181/how-to-un-escape-a-backslash-escaped-string) ? – Alex W Feb 19 '20 at 16:20
  • @Adam `import html.parser` should solve that – Zachary Blackwood Feb 19 '20 at 16:24

2 Answers2

2

You need to know how you want your json output to look like. If you want the names to be the keys, and the values be the list of everything else, I would do something like:

from bs4 import BeautifulSoup
import json

html_content = """
<table>
    <tr>
        <td>John</td>
        <td>28</td>
        <td>Y</td>
        <td>12.30</td>
    </tr>
    <tr>
        <td>Kevin</td>
        <td>25</td>
        <td>Y</td>
        <td>22.30</td>
    </tr>
    <tr>
        <td>Smith</td>
        <td>38</td>
        <td>Y</td>
        <td>52.20</td>
    </tr>
    <tr>
        <td>Stewart</td>
        <td>21</td>
        <td>N</td>
        <td>3.80</td>
    </tr>
</table>
<h1> hello world <h1>
<table>
    <tr>
        <td>Jack</td>
        <td>1</td>
    </tr>
    <tr>
        <td>Joe</td>
        <td>2</td>
    </tr>
    <tr>
        <td>Bill</td>
        <td>3</td>
    </tr>
    <tr>
        <td>Sam</td>
        <td>4</td>
    </tr>
</table>
"""

html_content_parsed = [[cell.text for cell in row("td")]
                         for row in BeautifulSoup(html_content,features="html.parser")("tr")]

html_content_dictionary = {element[0]:element[1:] for element in html_content_parsed}

print(json.dumps(html_content_dictionary, indent=4))

As you can see, this will ignore other non-table elements and puts all the tables into json.

htmltojson_program_output

You can try out the program here: https://repl.it/@Mandawi/htmltojson

oamandawi
  • 405
  • 5
  • 15
  • 1
    Thank you. I have seen the same response here: https://stackoverflow.com/a/59968204/3480297 but this doesn't work when there are multiple tables or different elements other than "table" in the html. Do you know how the resolve that? – Adam Feb 19 '20 at 16:23
  • Yes, same idea! – oamandawi Feb 19 '20 at 16:24
  • What if there are multiple tables to the html_content? That only displays the first table for me. – Adam Feb 19 '20 at 16:28
  • I don't know what you mean by elements other than table. Do you want to put these elements in json as well? If you don't, then this will simply ignore them. If you do, then parse them the way you want them to look in json. – oamandawi Feb 19 '20 at 16:32
  • Sorry, I think the other elements as you mentioned can be formatted. But if there are multiple tables, the JSON outputs only the first table. Could you try that and see if the same happens to you? – Adam Feb 19 '20 at 16:33
  • No, it does not. See here: https://repl.it/@Mandawi/htmltojson – oamandawi Feb 19 '20 at 16:39
0

There is a library to convert html to json here (full disclosure: I am the author of this library). This library can convert HTML to JSON and has a specific function to convert only HTML tables to JSON (you give it HTML and it will find all tables and convert them to JSON).

For your specific use-case you can install the html-to-json library (see instructions here) and then run this:

import html_to_json

import html_to_json
s = '''<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
    </body>
</html>'''

html_to_json.convert_tables(s)

As you can see in the output below, the html-to-json library uses the <th> elements (if available) as the keys for the output JSON:

[
  [
    {
      "Name": "John",
      "Age": "28",
      "License": "Y",
      "Amount": "12.30"
    },
    {
      "Name": "Kevin",
      "Age": "25",
      "License": "Y",
      "Amount": "22.30"
    },
    {
      "Name": "Smith",
      "Age": "38",
      "License": "Y",
      "Amount": "52.20"
    },
    {
      "Name": "Stewart",
      "Age": "21",
      "License": "N",
      "Amount": "3.80"
    }
  ]
]

If you wanted to convert the entire HTML (and not just the table), you can replace html_to_json.convert_tables(s) with html_to_json.convert(s).

Floyd
  • 2,252
  • 19
  • 25