How to use JavaScript to fetch all HTML elements containing text,and filter out the specified element and its children?

Question

In simple terms

I want to get all the elements in the HTML that contain text and filter out specific elements(such as tags like 'pre' or 'script') and their children.

According to my Google search, querySelectorAll is inefficient, and TreeWalker is the most efficient, isn't it?

The problem with my code is that it filters out specific elements, but not the children of those elements.

I implemented a feature that uses Javascript to get all the text elements in HTML.

Some elements, such as "pre", or "div" with special class names,
I want to filter out from the results.

I filter these elements,
but their children are still retrieved,
I can't get rid of the children of the filtered elements.

What should I do?

I got my inspiration from this page:getElementsByTagName() equivalent for textNodes

document.createTreeWalker's documents:
https://developer.mozilla.org/en-US/docs/Web/API/Document/createTreeWalker#parameters

<!DOCTYPE html>
<html>
<head>
<script>
function nativeTreeWalker() {
    var walker = document.createTreeWalker(
        document.body, 
        NodeFilter.SHOW_TEXT,
        {acceptNode: function(node) {

          // ===========================
          // filter these element
          // But they can't filter their child elements??????
          if (['STYLE', 'SCRIPT', 'PRE'].includes(node.parentElement?.nodeName)) {
            return NodeFilter.FILTER_REJECT;
          }
          // ===========================

          // Filter empty elements
          if (! /^\s*$/.test(node.data) ) {
            return NodeFilter.FILTER_ACCEPT;
          }
        }
        },
        true  // Skipped child element, invalid
    );

    var node;
    var textNodes = [];
    while(node = walker.nextNode()){
        textNodes.push(node.nodeValue);
    }
    return textNodes
}

window.onload = function(){
  console.log(nativeTreeWalker())
}
</script>
</head>
<body>
get the text
<p> </p>
<div>This is text, get</div>
<p>This is text, get too</p>

<pre>
  This is code,Don't get
  <p>this is code too, don't get</p>
</pre>

<div class="this_is_code">
  This is className is code, Don't get
  <span>this is code too, don't get</span>
</div>
</body></html>

The correct result of the following code is the output:

0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
length: 3

Instead of output:

0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
3: "this is code too, don't get"
4: "\n This is className is code, Don't get\n "
5: "this is code too, don't get"
length: 6

What are you **actually** trying to accomplish here? The `TreeWalker` API is.. kinda old-and-busted so there's probably a more modern, better approach to take to traversing the DOM, for example, if you want to match/select/extract elements (not nodes) then have you tried `querySelectorAll` instead? — Dai, Jul 03 '22 at 18:58
I want to get all the elements in the HTML that contain text and filter out specific elements (such as tags like ‘Pre’ or ‘Script’). According to my Google search, querySelectorAll is inefficient, and TreeWalker is the most efficient, isn't it? The problem with my code is that it filters out specific elements, but not the children of those elements. — dong, Jul 03 '22 at 19:34
What articles/resources are claiming that `querySelectorAll` is "inefficient", exactly? (Yes, it's _imperceivably_ slower than `getElementById` or `getElementsByTagName` but those are very specific use-cases, in the general-case `querySelectorAll` is a first-class, first-rate DOM function. — Dai, Jul 03 '22 at 19:36
This answer shows that querySelectorAll is about 6 times slower than TreeWalker:https://stackoverflow.com/questions/2579666/getelementsbytagname-equivalent-for-textnodes — dong, Jul 03 '22 at 19:46
Those numbers are from 2012 (10 years ago, back when IE10 was popular) and are irrelevant today because all modern browsers have heavily optimized CSS selector matching engines. Please find a _recent_ article or benchmark from the past ~3 years. — Dai, Jul 03 '22 at 19:48
More importantly, querySelectorAll doesn't do what I need because I need to get "all elements with text" instead of the element by class name — dong, Jul 03 '22 at 19:49
I have revised the question according to your question. Thank you — dong, Jul 03 '22 at 19:53
@Dai `TreeWalker` is old (so are HTML, CSS, JS, etc.), but what is "busted" about it? — jsejcksn, Jul 03 '22 at 20:49

score 1 · Accepted Answer · answered Jul 03 '22 at 20:36

Your expectations are not quite accurate according to the code you've shown in your question. For example: the top-level text node which includes Don't get code: is a valid node according to your criteria.

You can use the TreeWalker API to achieve the desired results. Part of the solution to your problem is to select the closest parent of the text node which matches one of your criteria in order to validate the node:

Code in TypeScript Playground

<!doctype html>
<html>
<head>
<script type="module">
function filterTextNode (textNode) {
  if (!textNode.textContent?.trim()) return NodeFilter.FILTER_REJECT;
  const ancestor = textNode.parentElement?.closest('pre,script,style,.this_is_code');
  if (ancestor) return NodeFilter.FILTER_REJECT;
  return NodeFilter.FILTER_ACCEPT;
}

function getFilteredTexts (textNodeFilterFn) {
  const walker = document.createTreeWalker(
    document.body,
    NodeFilter.SHOW_TEXT,
    {acceptNode: textNodeFilterFn},
  );
  const results = [];
  let node = walker.nextNode();
  while (node) {
    results.push(node.textContent);
    node = walker.nextNode();
  }
  return results;
}

function main () {
  const texts = getFilteredTexts(filterTextNode);
  console.log(texts);
}

main();
</script>
</head>
<body>
  <p> </p>
  
  get text:
  <div>This is text, get</div>
  <p>This is text, get too</p>
  
  Don't get code:
  <pre>
    This is code,Don't get
    <p>this is code too, don't get</p>
  </pre>
  
  <div class="this_is_code">
    This is className is code, Don't get
    <span>this is code too, don't get</span>
  </div>
</body>
</html>

wow,It's work! You are incredible. I don't understand how it works, but it works. Just put `main();` inside `window.onload = function(){}`. thank you. — dong, Jul 03 '22 at 21:25
@dong Glad it’s working for you. [Does it answer your question?](https://stackoverflow.com/help/someone-answers) (Also note that execution of [modules](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Modules) is automatically deferred and therefore generally doesn’t need to be scheduled in response to the window’s `load` event.) — jsejcksn, Jul 03 '22 at 22:36
Yes, I've identified the answer as "solve the problem,". I have a question. How does this line of code filter this element and all its children? It even includes children of child elements. `textNode.parentElement? .closest('script,style,pre,title,code,.this_is_code');` — dong, Jul 04 '22 at 12:37
@dong I assume you already read the documentation that I linked to in the answer for [`Element.closest()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/closest): it searches for the closest ancestor element of the parent element of the text node which matches the provided selector. If such an element is found, then that means the text node is within a tree that you want to avoid, so it is rejected as a candidates result on the following line. — jsejcksn, Jul 04 '22 at 16:53

How to use JavaScript to fetch all HTML elements containing text,and filter out the specified element and its children?

1 Answers1