In simple terms
I want to get all the elements in the HTML that contain text and filter out specific elements(such as tags like 'pre' or 'script') and their children.
According to my Google search, querySelectorAll is inefficient, and TreeWalker is the most efficient, isn't it?
The problem with my code is that it filters out specific elements, but not the children of those elements.
I implemented a feature that uses Javascript to get all the text elements in HTML.
Some elements, such as "pre", or "div" with special class names,
I want to filter out from the results.
I filter these elements,
but their children are still retrieved,
I can't get rid of the children of the filtered elements.
What should I do?
I got my inspiration from this page:getElementsByTagName() equivalent for textNodes
document.createTreeWalker
's documents:
https://developer.mozilla.org/en-US/docs/Web/API/Document/createTreeWalker#parameters
<!DOCTYPE html>
<html>
<head>
<script>
function nativeTreeWalker() {
var walker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT,
{acceptNode: function(node) {
// ===========================
// filter these element
// But they can't filter their child elements??????
if (['STYLE', 'SCRIPT', 'PRE'].includes(node.parentElement?.nodeName)) {
return NodeFilter.FILTER_REJECT;
}
// ===========================
// Filter empty elements
if (! /^\s*$/.test(node.data) ) {
return NodeFilter.FILTER_ACCEPT;
}
}
},
true // Skipped child element, invalid
);
var node;
var textNodes = [];
while(node = walker.nextNode()){
textNodes.push(node.nodeValue);
}
return textNodes
}
window.onload = function(){
console.log(nativeTreeWalker())
}
</script>
</head>
<body>
get the text
<p> </p>
<div>This is text, get</div>
<p>This is text, get too</p>
<pre>
This is code,Don't get
<p>this is code too, don't get</p>
</pre>
<div class="this_is_code">
This is className is code, Don't get
<span>this is code too, don't get</span>
</div>
</body></html>
The correct result of the following code is the output:
0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
length: 3
Instead of output:
0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
3: "this is code too, don't get"
4: "\n This is className is code, Don't get\n "
5: "this is code too, don't get"
length: 6