How to find all strings on a page?

Question

I almost managed to do what I want, but there is a small flaw.

I have this HTML

<body>
  <div>
    <div>div</div>
  </div>

  <h1>
    <h2>
      <p>p1</p>

      <p>
        <p>p2</p>
      </p>
    </h2>

    <h3>
      <h2>h2</h2>
      <h2>h2</h2>
    </h3>
  </h1>

  <span>span</span>
  <h6>
    <h6>h6</h6>
  </h6>
</body>

And my last attempt gives me almost the array I want

var elements = Array.from(document.body.getElementsByTagName("*"));
var newStrings = [];

for (var i = 0; i < elements.length; i++) {
  const el = elements[i];
  if (el.innerText.length !== 0) {
    newStrings.push(el.innerText);
  }
}

console.log(newStrings); //  ['div', 'div', 'p1\n\np2', 'p1', 'p2', 'h2', 'h2', 'span', 'h6']

but as a result I need ['div', 'p1', 'p2', 'h2', 'h2', 'span', 'h6']

I will be very grateful for your help!

score 2 · Answer 1 · answered Jan 08 '23 at 22:28

The best way to get all the strings on the page is to select all text nodes in the page and then get the text content of each (this way, you avoid getting duplicate strings in cases where you select the innerText of both the parent and child).

Here is one way to select all the text nodes in a page (adapted from https://stackoverflow.com/a/10730777/19461620):

const textNodes = [];
const walker = document.createTreeWalker(document.body, NodeFilter.SHOW_TEXT, null, false);
let n;
while (n = walker.nextNode()) textNodes.push(n);
const newStrings = textNodes.map(textNode => textNode.textContent).filter(text => text.trim() !== '')
console.log(newStrings) // outputs: ['div', 'p1', 'p2', 'h2', 'h2', 'span', 'h6']

score 1 · Accepted Answer · answered Jan 08 '23 at 22:30

try this, you will get the desired output:

function getInnerText() {
    const elements = document.querySelectorAll("*");
  
    const innerTexts = [];
  
    for (let element of elements) {
      const innerText = element.innerText;
  
      if (innerText && innerText.length > 0 && innerText.trim().length > 0) {
        innerTexts.push(innerText);
      }
    }
  
    return innerTexts[0].split('\n').filter(function (el) {
        return el != "";
        });
  }

const innerTexts = getInnerText();

console.log(innerTexts);

score 1 · Answer 3 · answered Jan 09 '23 at 03:56

First off, the HTML is is invalid. <h1> to <h6>, and <p> may only contain phrasing content of which does NOT include <h1> to <h6>, or <p>. In the example below, the HTML has been corrected.

Details are commented in example

/**
 * Using a nodeIterator to extract all textNodes of a given DOM element.
 * @param {string<selector>|Object<DOM>} tag - Either a CSS selector
 *        or a DOM Object of an element to extract text from. If nothing or
 *        something invalid is passed, @default is document.body.
 * @returns {array} - An array of strings
 */
function getText(tag = document.body) {
  /**
   * If a string is passed, reference with .querySelector().
   * If a valid DOM Obkect is passed reference it.
   * If niether then use @default document.body.
   */
  let root = typeof tag === "string" ? document.querySelector(tag) : tag;
  let result = [], current;
  /**
   * Create a nodeIterator.
   * For details go to: 
   * https://javascript.plainenglish.io/what-is-the-javascript-nodeiterator-api-c4443b79b492
   * @param {Object<DOM>} root - Start extracting text from this node.
   * @param {Object<NodeFilter>} whatToShow - Built-in filter.
   * @param {function} filter - A custom filter.
   * @returns {NodeList} - An array-like object of nodes.
   */
  const itr = document.createNodeIterator(
    root, 
    NodeFilter.SHOW_TEXT, // Filters in text.
    (node) => {
      // Filter out <script> and <style> tags.
      if (node.parentElement.tagName === "SCRIPT" || node.parentElement.tagName === "STYLE") {
        return NodeFilter.FILTER_SKIP;
      }
      return NodeFilter.FILTER_ACCEPT;
    }
  );
  // Add each textNode to array
  while (current = itr.nextNode()) {
    result.push(current.nodeValue);
  }
  // Return the array with all whitespaces filtered out.
  return result.flatMap(node => node.trim() || []);
}
console.log(JSON.stringify(getText()));

<!-- Comments show unfiltered results for each textNode with at least one
word charater -->
<div>
  <div>div</div> <!-- "div" -->
</div>

<header>
  <h1>
    <i>h1 </i> <!-- "h1 " space after text-->
  </h1>
  <h2>
    <u>h2</u> <!-- "h2" zero width spaces can hinder matching -->
  </h2>
  <h3>
    <i>h3
    <!-- "h3\n    " new line and tab after text--> 
    </i>
    <q> h3 </q> <!-- " h3 " space before and after text -->
  </h3>
</header>

<span>  span</span> <!-- "  span" tab before text -->
<h6>
  <u>
  h6</u> <!-- "\n  h6" new line and tab before text -->
</h6>

How to find all strings on a page?

3 Answers3