77

I'm attempting map HTML into JSON with structure intact. Are there any libraries out there that do this or will I need to write my own? I suppose if there are no html2json libraries out there I could take an xml2json library as a start. After all, html is only a variant of xml anyway right?

UPDATE: Okay, I should probably give an example. What I'm trying to do is the following. Parse a string of html:

<div>
  <span>text</span>Text2
</div>

into a json object like so:

{
  "type" : "div",
  "content" : [
    {
      "type" : "span",
      "content" : [
        "Text2"
      ]
    },
    "Text2"
  ]
}

NOTE: In case you didn't notice the tag, I'm looking for a solution in Javascript

Damjan Pavlica
  • 31,277
  • 10
  • 71
  • 76
Julian Krispel-Samsel
  • 7,512
  • 3
  • 33
  • 40
  • 2
    what are you trying to achieve in general? – Tom Oct 19 '12 at 18:55
  • 1
    What's your environment? Browser? Server? – I Hate Lazy Oct 19 '12 at 18:55
  • @zzzzBov you'll need to do a whole lot more than 'just iterating' through the dom to be a good html2json parser I assume. the idea of this question is to see if somebody did this job already and whether I can use it/learn from it... – Julian Krispel-Samsel Oct 19 '12 at 20:56
  • @nimrod, HTML elements contain nodes, nodes can be either text, comments, or elements, elements have attributes, elements have namespaces, elements have names. Start at ``, recurse through each child node. Done. – zzzzBov Oct 19 '12 at 21:19
  • @zzzzBov as I said I'm trying to parse a string of html, I'm not reading from the dom... – Julian Krispel-Samsel Oct 20 '12 at 01:28
  • 3
    @nimrod, create a document fragment using your HTML string, and let the DOM do the work for you. It doesn't have to be appended to the page for you to take advantage of the web browser's HTML parsing abilities. – zzzzBov Oct 20 '12 at 02:56
  • So it's browser, not server. You do understand that JavaScript runs on the server, right? Was such a simple question. – I Hate Lazy Oct 21 '12 at 05:01
  • @user1689607 it doesn't really matter whether the environment is browser or server... I just thought it was silly that you made the same comment twice... – Julian Krispel-Samsel Oct 21 '12 at 14:03
  • Of course it matters. Browsers have a built-in HTML parser. You're asking about HTML parsing. If you knew if it mattered or not, you probably would have no need to ask the question in the first place. – I Hate Lazy Oct 21 '12 at 14:43
  • @user1689607 if I knew that it mattered I would have asked the question... you're right of course. – Julian Krispel-Samsel Oct 21 '12 at 20:19
  • you could try to use e-json from EHTML: https://github.com/Guseyn/EHTML – Guseyn Ismayylov Nov 22 '19 at 14:57

5 Answers5

91

I just wrote this function that does what you want; try it out let me know if it doesn't work correctly for you:

// Test with an element.
var initElement = document.getElementsByTagName("html")[0];
var json = mapDOM(initElement, true);
console.log(json);

// Test with a string.
initElement = "<div><span>text</span>Text2</div>";
json = mapDOM(initElement, true);
console.log(json);

function mapDOM(element, json) {
    var treeObject = {};
    
    // If string convert to document Node
    if (typeof element === "string") {
        if (window.DOMParser) {
              parser = new DOMParser();
              docNode = parser.parseFromString(element,"text/xml");
        } else { // Microsoft strikes again
              docNode = new ActiveXObject("Microsoft.XMLDOM");
              docNode.async = false;
              docNode.loadXML(element); 
        } 
        element = docNode.firstChild;
    }
    
    //Recursively loop through DOM elements and assign properties to object
    function treeHTML(element, object) {
        object["type"] = element.nodeName;
        var nodeList = element.childNodes;
        if (nodeList != null) {
            if (nodeList.length) {
                object["content"] = [];
                for (var i = 0; i < nodeList.length; i++) {
                    if (nodeList[i].nodeType == 3) {
                        object["content"].push(nodeList[i].nodeValue);
                    } else {
                        object["content"].push({});
                        treeHTML(nodeList[i], object["content"][object["content"].length -1]);
                    }
                }
            }
        }
        if (element.attributes != null) {
            if (element.attributes.length) {
                object["attributes"] = {};
                for (var i = 0; i < element.attributes.length; i++) {
                    object["attributes"][element.attributes[i].nodeName] = element.attributes[i].nodeValue;
                }
            }
        }
    }
    treeHTML(element, treeObject);
    
    return (json) ? JSON.stringify(treeObject) : treeObject;
}

Working example: http://jsfiddle.net/JUSsf/ (Tested in Chrome, I can't guarantee full browser support - you will have to test this).

​It creates an object that contains the tree structure of the HTML page in the format you requested and then uses JSON.stringify() which is included in most modern browsers (IE8+, Firefox 3+ .etc); If you need to support older browsers you can include json2.js.

It can take either a DOM element or a string containing valid XHTML as an argument (I believe, I'm not sure whether the DOMParser() will choke in certain situations as it is set to "text/xml" or whether it just doesn't provide error handling. Unfortunately "text/html" has poor browser support).

You can easily change the range of this function by passing a different value as element. Whatever value you pass will be the root of your JSON map.

TylerH
  • 20,799
  • 66
  • 75
  • 101
George Reith
  • 13,132
  • 18
  • 79
  • 148
22

htlm2json

Representing complex HTML documents will be difficult and full of corner cases, but I just wanted to share a couple techniques to show how to get this kind of program started. This answer differs in that it uses data abstraction and the toJSON method to recursively build the result

Below, html2json is a tiny function which takes an HTML node as input and it returns a JSON string as the result. Pay particular attention to how the code is quite flat but it's still plenty capable of building a deeply nested tree structure – all possible with virtually zero complexity

const Elem = e => ({
  tagName: 
    e.tagName,
  textContent:
    e.textContent,
  attributes:
    Array.from(e.attributes, ({name, value}) => [name, value]),
  children:
    Array.from(e.children, Elem)
})

const html2json = e =>
  JSON.stringify(Elem(e), null, '  ')
  
console.log(html2json(document.querySelector('main')))
<main>
  <h1 class="mainHeading">Some heading</h1>
  <ul id="menu">
    <li><a href="/a">a</a></li>
    <li><a href="/b">b</a></li>
    <li><a href="/c">c</a></li>
  </ul>
  <p>some text</p>
</main>

In the previous example, the textContent gets a little butchered. To remedy this, we introduce another data constructor, TextElem. We'll have to map over the childNodes (instead of children) and choose to return the correct data type based on e.nodeType – this gets us a littler closer to what we might need

const TextElem = e => ({
  type:
    'TextElem',
  textContent:
    e.textContent
})

const Elem = e => ({
  type:
    'Elem',
  tagName: 
    e.tagName,
  attributes:
    Array.from(e.attributes, ({name, value}) => [name, value]),
  children:
    Array.from(e.childNodes, fromNode)
})

const fromNode = e => {
  switch (e?.nodeType) {
    case 1: return Elem(e)
    case 3: return TextElem(e)
    default: throw Error(`unsupported nodeType: ${e.nodeType}`) 
  }
}

const html2json = e =>
  JSON.stringify(Elem(e), null, '  ')
  
console.log(html2json(document.querySelector('main')))
<main>
  <h1 class="mainHeading">Some heading</h1>
  <ul id="menu">
    <li><a href="/a">a</a></li>
    <li><a href="/b">b</a></li>
    <li><a href="/c">c</a></li>
  </ul>
  <p>some text</p>
</main>

Anyway, that's just two iterations on the problem. Of course you'll have to address corner cases where they come up, but what's nice about this approach is that it gives you a lot of flexibility to encode the HTML however you wish in JSON – and without introducing too much complexity

In my experience, you could keep iterating with this technique and achieve really good results. If this answer is interesting to anyone and would like me to expand upon anything, let me know ^_^

Related: Recursive methods using JavaScript: building your own version of JSON.stringify


json2html

Above we go from HTML to JSON and now we can go from JSON to HTML. When we can convert between two data types without losing data, this is called an isomorphism. All we are essentially doing here is writing the inverses of each function above -

const HtmlNode = (tagName, attributes = [], children = []) => {
  const e = document.createElement(tagName)
  for (const [k, v] of attributes) e.setAttribute(k, v)
  for (const child of children) e.appendChild(toNode(child))
  return e
}

const TextNode = (text) => {
  return document.createTextNode(text)
}
  
const toNode = t => {
  switch (t?.type) {
    case "Elem": return HtmlNode(t.tagName, t.attributes, t.children)
    case "TextElem": return TextNode(t.textContent)
    default: throw Error("unsupported type: " + t.type)
  }
}

const json2html = json =>
  toNode(JSON.parse(json))

const parsedJson =
  {"type":"Elem","tagName":"MAIN","attributes":[],"children":[{"type":"TextElem","textContent":"\n  "},{"type":"Elem","tagName":"H1","attributes":[["class","mainHeading"]],"children":[{"type":"TextElem","textContent":"Some heading"}]},{"type":"TextElem","textContent":"\n  "},{"type":"Elem","tagName":"UL","attributes":[["id","menu"]],"children":[{"type":"TextElem","textContent":"\n    "},{"type":"Elem","tagName":"LI","attributes":[],"children":[{"type":"Elem","tagName":"A","attributes":[["href","/a"]],"children":[{"type":"TextElem","textContent":"a"}]}]},{"type":"TextElem","textContent":"\n    "},{"type":"Elem","tagName":"LI","attributes":[],"children":[{"type":"Elem","tagName":"A","attributes":[["href","/b"]],"children":[{"type":"TextElem","textContent":"b"}]}]},{"type":"TextElem","textContent":"\n    "},{"type":"Elem","tagName":"LI","attributes":[],"children":[{"type":"Elem","tagName":"A","attributes":[["href","/c"]],"children":[{"type":"TextElem","textContent":"c"}]}]},{"type":"TextElem","textContent":"\n  "}]},{"type":"TextElem","textContent":"\n  "},{"type":"Elem","tagName":"P","attributes":[],"children":[{"type":"TextElem","textContent":"some text"}]},{"type":"TextElem","textContent":"\n"}]}

document.body.appendChild(toNode(parsedJson))
Mulan
  • 129,518
  • 31
  • 228
  • 259
  • hey this is great, how do you convert your JSON back to HTML? – minigeek Dec 14 '21 at 07:05
  • i am trying to make html2json and json2html converter for my project, didn't find any npm package for it. reason for choosing this way is to build scalable and dragndrop platform :(. can you provide a json2html too please – minigeek Dec 14 '21 at 07:23
  • @minigeek sure, i added `json2html`. please note this post is a bit old and i updated `html2json` to simplify it even more. as mentioned there are corner cases that may arise depending on your particular structure. and there is nothing here that handles event listeners or other data you may have attached to your nodes. you will have to address those things on your own. – Mulan Dec 14 '21 at 16:09
  • wow thanks, man, you are a genius! – minigeek Dec 16 '21 at 04:32
1

I got few links sometime back while reading on ExtJS full framework in itself is JSON.

http://www.thomasfrank.se/xml_to_json.html

http://camel.apache.org/xmljson.html

online XML to JSON converter : http://jsontoxml.utilities-online.info/

UPDATE BTW, To get JSON as added in question, HTML need to have type & content tags in it too like this or you need to use some xslt transformation to add these elements while doing JSON conversion

<?xml version="1.0" encoding="UTF-8" ?>
<type>div</type>
<content>
    <type>span</type>
    <content>Text2</content>
</content>
<content>Text2</content>
Nas
  • 887
  • 1
  • 11
  • 21
1

Thank you @Gorge Reith. Working off the solution provided by @George Reith, here is a function that furthers (1) separates out the individual 'hrefs' links (because they might be useful), (2) uses attributes as keys (since attributes are more descriptive), and (3) it's usable within Node.js without needing Chrome by using the 'jsdom' package:

const jsdom = require('jsdom') // npm install jsdom provides in-built Window.js without needing Chrome


// Function to map HTML DOM attributes to inner text and hrefs
function mapDOM(html_string, json) {
    treeObject = {}

    // IMPT: use jsdom because of in-built Window.js
    // DOMParser() does not provide client-side window for element access if coding in Nodejs
    dom = new jsdom.JSDOM(html_string)
    document = dom.window.document
    element = document.firstChild

    // Recursively loop through DOM elements and assign attributes to inner text object
    // Why attributes instead of elements? 1. attributes more descriptive, 2. usually important and lesser
    function treeHTML(element, object) {
        var nodeList = element.childNodes;
        if (nodeList != null) {
           if (nodeList.length) {
               object[element.nodeName] = []  // IMPT: empty [] array for non-text recursivable elements (see below)
               for (var i = 0; i < nodeList.length; i++) {
                   // if final text
                   if (nodeList[i].nodeType == 3) {
                       if (element.attributes != null) {
                           for (var j = 0; j < element.attributes.length; j++) {
                                if (element.attributes[j].nodeValue !== '' && 
                                    nodeList[i].nodeValue !== '') {
                                    if (element.attributes[j].name === 'href') { // separate href
                                        object[element.attributes[j].name] = element.attributes[j].nodeValue;
                                    } else {
                                        object[element.attributes[j].nodeValue] = nodeList[i].nodeValue;
                                    }

                                }
                           }
                       }
                   // else if non-text then recurse on recursivable elements
                   } else {
                       object[element.nodeName].push({}); // if non-text push {} into empty [] array
                       treeHTML(nodeList[i], object[element.nodeName][object[element.nodeName].length -1]);
                   }
               }
           }
        }
    }
    treeHTML(element, treeObject);

    return (json) ? JSON.stringify(treeObject) : treeObject;
}
Yi Xiang Chong
  • 744
  • 11
  • 9
1

I had a similar issue where I wanted to represent HTML as JSON in the following way:

  • For HTML text nodes, use a string
  • For HTML elements, use an array with:
    • The (tag) name of the element
    • An object, mapping attribute keys to attribute values
    • The (inlined) list of children nodes

Example:

<div>
  <span>text</span>Text2
</div>

becomes

[
   'div',
   {},
   ['span', {}, 'text'],
   'Text2'
]

I wrote a function which handles transforming a DOM Element into this kind of JS structure. You can find this function at the end of this answer. The function is written in Typescript. You can use the Typescript playground to convert it to clean JavaScript.


Furthermore, if you need to parse an html string into DOM, assign to .innerHtml:

let element = document.createElement('div')
element.innerHtml = htmlString

Also, this one is common knowledge but if you need a JSON string output, use JSON.stringify.


/**
 * A NodeDescriptor stands for either an (HTML) Element, or for a text node
 */
export type NodeDescriptor = ElementDescriptor | string

/**
 * Array representing an HTML Element. It consists of:
 *
 * - The (tag) name of the element
 * - An object, mapping attribute keys to attribute values
 * - The (inlined) list of children nodes
 */
export type ElementDescriptor = [
   string,
   Record<string, string>,
   ...NodeDescriptor[]
]

export let htmlToJs = (element: Element, trim = true): ElementDescriptor => {
   let convertElement = (element: Element): ElementDescriptor => {
      let attributeObject: Record<string, string> = {}
      for (let { name, value } of element.attributes) {
         attributeObject[name] = value
      }

      let childArray: NodeDescriptor[] = []
      for (let node of element.childNodes) {
         let converter = htmlToJsDispatch[node.nodeType]
         if (converter) {
            let descriptor = converter(node as any)
            let skip = false

            if (trim && typeof descriptor === 'string') {
               descriptor = descriptor.trim()
               if (descriptor === '') skip = true
            }

            if (!skip) childArray.push(descriptor)
         }
      }

      return [element.tagName.toLowerCase(), attributeObject, ...childArray]
   }

   let htmlToJsDispatch = {
      [element.ELEMENT_NODE]: convertElement,
      [element.TEXT_NODE]: (node: Text): string => node.data,
   }

   return convertElement(element)
}
Mathieu CAROFF
  • 1,230
  • 13
  • 19