0

Context

I'm building a set of 'extractor' functions whose purpose is to extract what looks like components from a page (using jsdom and nodejs). The final result should be these 'component' objects ordered by where they originally appeared in the page.

Problem

The last part of this process is a bit problematic. As far as I can see, there's no easy way to tell where a given element is in a given dom document's source code.

The numeric depth or css/xpath-like path also doesn't feel helpful in this case.

Example

With the given extractors...

const extractors = [

  // Extract buttons
  dom => 
    Array.from(dom.window.document.querySelectorAll('button'))
    .map(elem => ({
      type: 'button',
      name: elem.name,
      position:        /* this part needs to be computed from elem */
    })),

  // Extract links
  dom => 
    Array.from(dom.window.document.querySelectorAll('a'))
    .map(elem => ({
      type: 'link',
      name: elem.textContent,
      position:        /* this part needs to be computed from elem */
      link: elem.href,
    })),

];

...and the given document (I know, it's an ugly and un-semantic example..):

<html>
  <body>
    <a href="/">Home</a>
    <button>Login</button>
    <a href="/about">About</a>
...

I need something like:

[
  { type: 'button', name: 'Login', position: 45, ... },
  { type: 'link', name: 'Home', position: 20, ... },
  { type: 'link', name: 'About', position: 72, ... },
]

(which can be later ordered by item.position)

For example, 45 is the position/offset of the <button with the example html string.

Christian
  • 27,509
  • 17
  • 111
  • 155
  • What exactly do you mean by "position"? Does it have to be the position of a name in the string representation of the dom or can it be its logical position in the dom hierarchy? Using your sample html, for example, that logical position for the first `` element would be 3 (or 2, counting from zero) including the root element. Would that be enough? – Jack Fleeting Jul 30 '22 at 10:49
  • Assuming jsdom implements the complete DOM API, you could just [compare the nodes](https://developer.mozilla.org/en-US/docs/Web/API/Node/compareDocumentPosition) directly – Bergi Jul 30 '22 at 11:25
  • @JackFleeting I meant position in the source code. ie, a naive approach would be `document.body.parentElement.outerHTML.indexOf(elemToFind.outerHTML)`. – Christian Jul 30 '22 at 13:31
  • @Bergi I thought about that but then I need to keep tracking of every extracted components' topmost element so that I can do the comparison later. It feels too much work for something that feels very static and immutable (unless the DOM changes). – Christian Jul 30 '22 at 13:35
  • @Christian You don't need to keep track of the "topmost element", just of the extracted element itself. But yes, especially if the DOM doesn't change, walking the DOM once and assigning every element an index will be the simplest and most efficient solution. – Bergi Jul 30 '22 at 16:56

3 Answers3

1

You could just iterate all the elements in the DOM and assign them an index, given your DOM doesn't change:

const pos = new Symbol('document position');
for (const [index, element] of document.querySelectorAll('*').entries()( {
    element[pos] = index;
}

Then your extractor can just use that:

dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
  type: 'link',
  name: elem.textContent,
  position: elem[pos],
  link: elem.href,
})),

Alternatively, JSDOM provides a feature where it attaches the source position in the parsed HTML text to every node, you can also use that - see includeNodeLocations. The startOffset will be in document order as well. So if you parse the input with that option enabled, you can use

dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
  type: 'link',
  name: elem.textContent,
  position: dom.nodeLocation(elem).startOffset,
  link: elem.href,
})),
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
0

I'm not sure this is exactly (or even close to) what you are after, but it may get you closer:

   extracted = []
        
   elems = [...document.querySelectorAll('*')];
   for (let elem of elems) {
          entry = []
          entry.push(elem.tagName, elem.innerText, elems.indexOf(elem))
          extracted.push(entry)
        }

Then if you want to look up a specific element (assuming the DOM didn't change...), something like this should work:

extracted.filter(x =>
  x[0] == "A")
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Don't use `elems.indexOf(elem)`, that's horribly inefficient! Just keep track of a counter, or use `for (const [index, elem] of elems.entries())` – Bergi Jul 30 '22 at 16:59
-1

One possible rough way I can think of is something like:

function findPos(elem){
  elem.setAttribute('data-pf', '1');
  try {
    return elem.ownerDocument.documentElement.outerHTML.indexOf('data-pf');
  } finally {
    elem.removeAttribute('data-pf');
  }
}

see also: https://github.com/jsdom/jsdom#serializing-the-document-with-serialize

However on top of being imprecise, it feels like overkill and possibly badly performing (unless it's crazy slow, that's not a big problem since this task is a one-time job).

Christian
  • 27,509
  • 17
  • 111
  • 155
  • No, don't do `.outerHTML.indexOf()`. Just get the [`nodePosition`](https://github.com/jsdom/jsdom#getting-the-source-location-of-a-node-with-nodelocationnode) if you want to follow this approach – Bergi Jul 30 '22 at 11:26
  • @Bergi nice find, in essence that would do what I mentioned in zer00ne's answer. You can also add it as answer, by the way.. – Christian Jul 30 '22 at 13:41
  • @Bergi actually please do add it as an answer, I've taken this approach since it's extremely faster to set up and easy to use. – Christian Jul 30 '22 at 16:12