Unusual conversion from html to json file

Question

I have a POC to deliver which entitle to convert html content into a json file. This means the json file needs to be in a specific format which I can't figure out how to display it. I have not working out how to format the nodeChild elements in the format requested, so I need help on this matter.

This is the HTML content:

<body>
    <style>
        .myclass{padding-top:50px; left:0;}
    </style>

    <div id="maincontent">
      <div id="myid">
          <p class="myclass">
              This is a paragraph
          </div>
      </div>
</body>

And this is the .json format I need to be expelled out from the HTML content:

"t" stands for "type:, "s" stands for "style" and "h" stands for "html"

[
{
    "t": "s",
    "s": ".myclass{padding-top:50px; left:0;}"
},
{
    "t": "h",
    "h": "<div id='myid'><p class='myclass'>This is a paragraph</p></div>"
}]

At the moment the file generated looks like this. But I need to place all the content in an unique line and not separated.

{
            "t": "DIV",
            "content": [{
                "t": "DIV",
                "content": ["This is a paragraph"],
                "s": {
                    "class": "myclass"
                }
            }],
        }

Any help will be appreciated.

I second the above comment. I don't know what your end goal is, but wouldn't it make it easier if you used an actual JSON representation of the DOM nodes and their children, like this library does? [node-html-parser](https://www.npmjs.com/package/node-html-parser) — blex, Feb 14 '20 at 11:01
If the style was inside the paragraph would be posted inside the "h" iniline with the paragraph. At the moment I can save a .json file, but the format is not the expected as the childNodes are all separated by tags and content. — Fernando Fas, Feb 14 '20 at 11:03
Ok. Also, do you do this in a browser, or server-side with NodeJS? That's a big difference. If it's in the browser, you can take advantage of it to traverse the DOM, otherwise, you'll want to use a DOM parsing library, since [Regex is not the best solution](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — blex, Feb 14 '20 at 11:07
I'm doing with NodeJS, mapping the DOM elements and parsing as a string and then pushing through a function that creates a list of childNodes and nodeValues. At the moment my file looks like this: { "t": "DIV", "content": [{ "t": "SPAN", "content": ["Watch how it works"], "s": {"class": "cta_l" }], "s": {"id": "cta", "data-bind": "text:atomk_cta_1.Value" }}. ------> I need to convert all in one type and content. — Fernando Fas, Feb 14 '20 at 11:15

blex · Accepted Answer · 2020-02-20T18:22:03.863

0

Here is one way to do it, using the node-html-parser library and Array.prototype.reduce():

// Don't forget to `npm i -S node-html-parser`
const HTMLParser = require("node-html-parser");

const root = HTMLParser.parse(
  `<!DOCTYPE html>
              <html>
                <head> </head>
                <body>
                  <style>
                    .myclass {
                      padding-top: 50px;
                      left: 0;
                    }
                  </style>

                  <script>
                    var name = 'world';
                    console.log('Hello ' + name);
                  </script>

                  <div id="maincontent">
                    <div id="myid">
                      <p class="myclass">
                        This is a paragraph
                      </p>
                    </div>
                  </div>
                </body>
              </html>`,
  {
    style: true, // Keep styles
    script: true // Keep scripts
  }
).querySelector("body");

// Clean up whitespace text nodes
root.removeWhitespace();

const result = root.childNodes.reduce((res, node) => {
  // Get the last group (`h`, `s` or `j`)
  const previousGroup = res.slice(-1).pop();
  // Get the type and content for this node
  const { type, propertyName, content } = getProperties(node);

  // If previousGroup exists and it's of the same type
  if (previousGroup && previousGroup.t === type) {
    // Add the content to it
    previousGroup[propertyName] += content;
  } else {
    // Otherwise, create a new group
    res.push({ t: type, [propertyName]: content });
  }

  return res;
}, []);

function getProperties(node) {
  switch (node.tagName) {
    case "style":
      return { type: "s", propertyName: "s", content: node.rawText.replace(/\s+/g, " ") };
    case "script":
      return { type: "j", propertyName: "s", content: node.rawText };
    default:
      return { type: "h", propertyName: "h", content: node.innerHTML };
  }
}

console.log(result);
// Returns:
// [
//   { t: 's', s: '.myclass { padding-top: 50px; left: 0; }' },
//   { t: 'j', s: 'var name = \'world\';\n            console.log(\'Hello \' + name);' },
//   { t: 'h', h: '<div id="myid"><p class="myclass">This is a paragraph</p></div>' }
// ]

edited Feb 20 '20 at 18:22

answered Feb 14 '20 at 11:43

blex

24,941
5
39
72

@Fernando Fas after posting this answer, I noticed you did not include the `maincontent` div in your example. So, do you actually not want any of the _top level_ nodes' outerHTML? Just their innerHTML? Also, if there is a ` – blex Feb 14 '20 at 11:51
Hi blex, I just need the innerHTML. The style will always group together and will always be on the top as your amazing code, and after all the divs all grouped together. I noticed that parsing content is part of the "const root". What if I need to parse the content of an external html file targeting the body tag? – Fernando Fas Feb 14 '20 at 12:11
@FernandoFas Ok, I edited my answer a bit. It now uses `.querySelector('body')` to target the body inside other HTML. And it uses `innerHTML` instead of `toString`. I noticed a bug in the library, however, it keeps whitespaces in CSS. Does it fit your needs, though? – blex Feb 14 '20 at 12:47
Hi bex, I can remove the whitespaces in css, but I will be happy to use your solution if you want to post it. Thank you. – Fernando Fas Feb 14 '20 at 14:00
I did :) I edited the answer above to reflect these changes – blex Feb 14 '20 at 14:08
I'm just testing it out and cleaning up the \n and white spaces. Also, I'm trying to pull the content of an external html file. I will let you know how it goes in a bit. – Fernando Fas Feb 14 '20 at 14:24
Hi bex, I still can't get rid of the "\n"," \" and whitespaces from the json file. I guess I need to work a little bit more on that. If you have time we could do this together. Also, i use fs.writeFile with JSONstringify and the file looks good apart of the "\n", "\" and white spaces. We are getting there. I'm not an experienced nodejs dev, but as far as it goes, it's an amazing progress in this POC. – Fernando Fas Feb 14 '20 at 15:33
I managed to remove the `\n` and multiple whitespaces from the CSS ([See edit](https://stackoverflow.com/posts/60225676/revisions), I added `.replace(/\s+/g, " ")` after `node.rawText`). But you don't have to remove the `\"`. They're just here so that quotes inside the string are not interpreted at the end of the string in JSON. If you read the value with JS or any other language, you won't actually see these `\\` – blex Feb 14 '20 at 16:39
Morning blex, I got some improvement on the codes, but now I got in some issue about the types which is getting more complicate as I need to add the type "j" which is the javascript type. I tried to add another const type, but I'm not getting the result expected. So, when it finds the " – Fernando Fas Feb 17 '20 at 11:16
Hello @FernandoFas I edited my answer to do this as well, using a `switch`. Note that I did not remove line breaks and spaces in JS content because it may alter the given scripts, if, for example, they use template literals with desired line breaks. But feel free to do it if that's not a problem for you – blex Feb 17 '20 at 11:38
Hi blex, I made some improvements to the function and it's working fine now. Thank you for your inputs and help on this one. – Fernando Fas Feb 19 '20 at 16:53
Some improvements to the function: function getTypeAndContent(node) { switch (node.tagName) { case "style":return { type: 's', content: node.rawText.replace(/\s+/g, " ") }; case "script":return {type: 'j', content: Buffer.from(node.rawText).toString('base64') }; case "frames":return {type: 'f', content: node.rawText.replace(/\s+/g, " ") }; default:return {type: 'h', content: node.innerHTML }; }} fs.writeFile('./orca/json/orca.json', JSON.stringify(result, null, '\t'), err => { if (err) { console.error(err) return } }) – Fernando Fas Feb 19 '20 at 17:11
Hey blex, I'm sorry to bother you again. I'm wondering if you could help me out with something on the functions. Do you think is possible to create a switch statement where it says: When the type = j content type = s. So "t":"j" "s":content. – Fernando Fas Feb 20 '20 at 09:23
Hi @FernandoFas sure, I added a `propertyName`, so you can have one different than the `type`. Check the updated code to see if it's ok – blex Feb 20 '20 at 18:26
Hey blex, that works perfectly. You are a NodeJS master. – Fernando Fas Feb 28 '20 at 08:48
Hi @blex, It's me again. Do you know if I can parse the data from an external file? – Fernando Fas Mar 25 '20 at 13:46
Hi @FernandoFas, sure, is the file on the same machine? If so, you can use `fs` to read the file. [Example here](https://pastebin.com/embed_iframe/gUzsBDun) – blex Mar 25 '20 at 14:18
Bless you. The file is in the same machine yes and it works like a charm. Now I just have to separate all the scripts that are joined together in separate groups. I owe you a few beers. – Fernando Fas Mar 25 '20 at 14:30

Unusual conversion from html to json file

1 Answers1