How can I build a tree of headings based on the DOM

Question

I have a DOM of headings. I want to output a tree from it.

const dom = new DOMParser().parseFromString(
  `<!DOCTYPE html>
  <body>
    <h1>A</h1>
    <h2>B1</h2>
    <h2>C</h2>
    <h3>D</h3>
    <h3>E</h3>
    <h4>F</h3>
    <h2>B2</h2>
  </body>`,
  "text/html"
);

This is the desired output, a hierarchical table of content tree:

const output = {
  A: {
    tag: "H1",
    children: [
      {
        B1: { tag: "H2" },
        C: {
          tag: "H2",
          children: [
            {
              D: { tag: "H3" },
              E: { tag: "H3" },
              children: [{ F: { tag: "H4" } }],
            },
          ],
        },
        B2: { tag: "H2" },
      },
    ],
  },
};

This is my attempt:

const headerTagSet = new Set(["h1", "h2", "h3", "h4"]);
const headerTags = ["h1", "h2", "h3", "h4"];

function buildTree(nodes) {
  const tree = {};
  let currentNode = tree;
  let prevLevel = 0;
  const stack = [];

  for (const node of nodes) {
    if (node.tagName && headerTagSet.has(node.tagName.toLowerCase())) {
      const currentLevel = headerTags.findIndex(
        (tag) => tag === node.tagName.toLowerCase()
      );
      const isNested = currentLevel > prevLevel;
      if (isNested) {
        const newNode = {
          [node.textContent]: {
            tag: node.tagName,
          },
        };
        stack.push(newNode);
        if (currentNode.childre) {
          currentNode.childre.push(newNode);
        } else {
          currentNode.children = [newNode];
        }
        prevLevel = currentLevel;
        currentNode = newNode;
      } else {
        currentNode[node.textContent] = { tag: node.tagName };
        stack.length = 0;
      }
    }
  }

  return tree;
}

buildTree(Array.from(doc.body.childNodes));

However, my current solution is not correct. I can't seem to figure out what the current algorithm is.

If I may make a suggestion that might not necessarily work for your use case: if at all possible, make your objects in your tree truly self-similar. Right now, `children` is an array of 1 element, an object with keys corresponding to the text contents of each header. I'd at least remove the extra array layer, or better yet, keep the array and move the object keys into a `content: "B1"` property per object. — ggorlen, Jul 25 '22 at 03:26
Being forced to do `Object.keys()` or `Object.entries()` to traverse a structure strikes me as an antipattern 6 times out of 7. — ggorlen, Jul 25 '22 at 03:39
@ggorlen do you mind writing up an answer explaining why that is an anti pattern and how you defined "being forced"? — Joji, Jul 25 '22 at 05:41
I would but that's not really your original question. You're forced to do it because of the structure. Arrays that always have 1 element might as well not be arrays and objects (nodes) in trees are much easier to use when you can rely on a steady set of keys--I think you're sort of conflating arrays and objects. It's not the end of the word, just a suggestion that'd make the code easier by a bit, and I suspect the structure more useful. Maybe you have reasons to keep it as is. — ggorlen, Jul 25 '22 at 14:27
you mean like I should've used array instead of object somewhere in the data structure? could you point out exactly where I did that? it's still a little unclear to me. — Joji, Jul 25 '22 at 17:39
Two smells. First, do you see `children`? It is always an array of exactly 1 element. So what's the point of that? Second smell is that `children` seems like it should be an array of the headers in order, represented as objects with `text: "B1"` (or whatever) as a property on each object. Rather than an object keyed based on the text contents. Many reasons for this, hard to explain. One is that if any headers happen to have the same text, they'll overwrite each other. Other issue is lack of self-similarity--objects aren't really meant to be iterated or ordered by design. Structure hard to use. — ggorlen, Jul 25 '22 at 17:51
Check out [Recursive tree search in a nested object structure in JavaScript](https://stackoverflow.com/questions/52066403/recursive-tree-search-in-a-nested-object-structure-in-javascript/) for a cleaner and more typical tree data structure, notwithstanding the smell that they should call their arrays "children" rather than "child" since it's an array. At least the structure is easy to work with and doesn't have redundant nesting and objects where arrays would probably be better. — ggorlen, Jul 25 '22 at 17:54
If your goal is fast access, key lookups by text content (or anything else)--that's what maps/objects do best--then probably keep it flat rather than a tree. Trees are for walking--arrays for children in n-ary trees. Unless you happen to guarantee text contents are unique and you need to be able to pick out the exact node in O(1) per level based on text, then it's justified, but this seems like a narrow use case and potentially premature optimization (size would have to be significant for this to be worth the pain). — ggorlen, Jul 25 '22 at 17:59

ggorlen · Accepted Answer · 2022-07-27T02:38:18.573

First of all, there seems to be an error in the structure. The expected shape for one of your nodes looks like ths:

{
  E: { tag: "H3" },
  children: [{ F: { tag: "H4" } }],
}

That is, a loose children tag conected to the root single object in C's children array, but I suspect this was intended:

{
  E: { tag: "H3", children: [{ F: { tag: "H4" } }] },
}

That is, the children property connected to E. I'm going to proceed under the latter assumption for a consistent structure.

Secondly, as described at length in the comments, the desired output structure seems suboptimal to me.

The first issue is that all children arrays have a single item always, which seems pointless. A guaranteed-one-item array might as well be stripped out and the lone element moved outward to replace the unnecessary wrapper. This saves an extra loop or [0] when it comes time to use this structure.

Secondly, when working with recursive structures, self-similarity of nodes is a desirable property which makes it easier to use the object. But the expected structure has arbitrary keys based on node text contents, which would generally be long strings in a typical DOM. Having to excessively use for ... in/Object.keys/Object.values/Object.entries or assign properties dynamically with {[key]: val} tend to be antipatterns, indicating abuse of objects as arrays.

Trees are usually used in traversals (looping over children arrays), not key lookups. Flat maps/objects are better for key lookups if that's how you ultimately aim to use the structure.

For a long time in JS' history, object key order wasn't guaranteed and there was no syntactical shorthand for dynamically assigning keys (i.e. {[key]: "val"}). Many other languages still don't provide these "features" for objects because they're easy to abuse to create suboptimal structures, as I'd argue is the case here.

Furthermore, with the desired structure, two sibling headers with the same text content will overwrite each other, resulting in potentially lost data and confusing bugs.

Another red flag is when the code to create the structure is fussy and difficult to write. That suggests the code to consume the structure will also be fussy and difficult to write. The point of programming is to make things as easy as possible, so I would choose the easier structure until forced to use something harder. And even then, if a third-party API requires it, I'd generally keep using the good structure and make the transformation to (and from) the uglier one at the last moment as a wrapper on the API.

To remedy these issues, consider the following array-based, self-similar structure. Easy to build and use.

const treeifyHeaders = dom => {
  const stack = [{tag: "H0", children: []}];
  const headers = "h1, h2, h3, h4, h5, h6";

  for (const header of dom.querySelectorAll(headers)) {
    const {tagName: tag, textContent: text} = header;
    const node = {tag, text};
    let last = stack.at(-1);

    while (last.tag >= node.tag) {
      stack.pop();
      last = stack.at(-1);
    }
    
    last.children = last.children || [];
    last.children.push(node);
    stack.push(node);
  }

  return stack[0].children;
};

const dom = new DOMParser().parseFromString(
  `<!DOCTYPE html>
  <body>
    <h1>A1</h1>
    <h2>B1</h2>
    <h2>B2</h2>
    <h2>B3</h2>
    <h2>B4</h2>
    <h3>C1</h3>
    <h3>C2</h3>
    <h4>D1</h4>
    <h6>F1</h6>
    <h4>D2</h4>
    <h2>B5</h2>
    <h3>C3</h3>
    <h1>A2</h1>
    <h2>B6</h2>
  </body>`,
  "text/html"
);

console.log(treeifyHeaders(dom));

.as-console-wrapper{max-height:100%!important}

Note that the output when a header skips nesting during a pop might be handled differently than expected. For example, <h1></h1><h3></h3><h2></h2> will set both <h2> and <h3> as direct children of <h1>--probably the most reasonable result given the circumstances. Since such a case isn't in your example, I assume it doesn't need to be handled yet.

Now, if you're certain that you "accept the terms and conditions" and you want to (or have to) use the original structure, here's the code to create it. Writing the code directly was annoying, so I just transformed the easier-to-use structure into the harder-to-use-one.

If you have huge trees, the weird tree can be produced directly without the nice tree step, but until the motivation exists, I'd avoid it as premature optimization.

const makeWeirdTree = roots => {
  const o = {};

  for (const child of roots) {
    o[child.text] = {tag: child.tag};

    if (child.children) {

      // remove these brackets to get rid of the extra array wrapper
      //                       V                             V
      o[child.text].children = [makeWeirdTree(child.children)];
    }
  }

  return o;
};

const treeifyHeaders = dom => {
  const stack = [{tag: "H0", children: []}];
  const headers = "h1, h2, h3, h4, h5, h6";

  for (const header of dom.querySelectorAll(headers)) {
    const {tagName: tag, textContent: text} = header;
    const node = {tag, text};
    let last = stack.at(-1);

    while (last.tag >= node.tag) {
      stack.pop();
      last = stack.at(-1);
    }

    last.children = last.children || [];
    last.children.push(node);
    stack.push(node);
  }

  return stack[0].children;
};

const expected = {
  A: {
    tag: "H1",
    children: [{
      B1: {tag: "H2"},
      C: {
        tag: "H2",
        children: [{
          D: {tag: "H3"},
          E: {
            tag: "H3",
            children: [{
              F: {tag: "H4"}
            }]
          },
        }],
      },
      B2: {tag: "H2"}
    }]
  }
};
const dom = new DOMParser().parseFromString(
  `<!DOCTYPE html>
  <body>
    <h1>A</h1>
    <h2>B1</h2>
    <h2>C</h2>
    <h3>D</h3>
    <h3>E</h3>
    <h4>F</h3>
    <h2>B2</h2>
  </body>`,
  "text/html"
);
console.log(makeWeirdTree(treeifyHeaders(dom)));
console.log(_.isEqual(expected, makeWeirdTree(treeifyHeaders(dom))));

.as-console-wrapper {max-height: 100%!important}

<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.4/lodash.min.js"></script>

Just to drive the point home, here's what basic traversal code looks like for the two structures. The version with children arrays feels a fair bit more natural than the one with objects.

const traverseNormal = (roots, depth=0) => {
  roots.forEach(({tag, text, children}) => {
    console.log("__".repeat(depth) + `${tag} => ${text}`);
    children && traverseNormal(children, depth + 1);
  });
};

const traverseWeird = (roots, depth=0) => {
  Object.entries(roots).forEach(([text, {tag, children}]) => {
    console.log("__".repeat(depth) + `${tag} => ${text}`);

    if (children && children[0]) {
      traverseWeird(children[0], depth + 1);
    }
  });
};

traverseNormal(treeifyHeaders(dom));
console.log("-".repeat(30));
traverseWeird(makeWeirdTree(treeifyHeaders(dom)));

Notice how if we'd made the node keys fully identical by adding empty arrays to leaves, the consumer's code becomes even simpler, avoiding the children && condition. The tradeoff is extra object allocations, which might be prohibitively expensive, but the point stands: prefer self-similarity in designing tree nodes.

Hi thanks for the answer! A quick question about the last bit where you DFS the tree. Why is that needed? Or are you saying that we can use it to traverse the output tree? — Joji, Jul 26 '22 at 19:24
It's not needed, it's just there to show that the structure you want is more annoying to work with than the structure I'm suggesting. Often times, people pick data structures without fully understanding what they're signing up for, leading to a lot of future pain. I'm trying to steer you away from what looks like a bad pattern. At the very least, I would remove the extra array. — ggorlen, Jul 26 '22 at 19:31
got it. Yea the structure I described in the question might not be ideal and I can certainly use the structure you proposed. Btw do you happen to know if there is any other library that does this `treeifyHeaders` for us? — Joji, Jul 26 '22 at 19:34
Not that I know of or I'd have used it. That said, I wouldn't be surprised if one existed, so doing some research is not a bad idea. Feel free to add an answer if you find one. — ggorlen, Jul 26 '22 at 19:37

How can I build a tree of headings based on the DOM

1 Answers1