Get text from HTML with appropriate whitespace

Question

It is easy to extract the text from HTML using the jQuery .text() method...

$("<p>This <b>That</b> Other</p>").text() == "This That Other"

But if there is no whitespace between the words/elements, then text becomes concatenated...

$("<p>This <b>That</b><br/>Other</p>").text() == "This ThatOther"
Desired: "This That Other"

$("<div><h1>Title</h1><p>Text</p></div>").text() == "TitleText"
Desired: "Title Text"

Is there any way to get all the text from the HTML (either using .text() or other methods) which would mean that the above examples would come out as desired?

@Shawn - there just needs to be some whitespace... the amount is irrelevant — freefaller, Dec 02 '19 at 18:01
What if you format your HTML with an IDE formatter? Or you could read up the HTML code as string, then use a HTML formatter library? — Adam, Dec 02 '19 at 18:18

score 4 · Accepted Answer · answered Dec 02 '19 at 18:15

You can traverse the DOM tree looking for a node with a nodeType of 3 (text node). When you find one, add it to an array. If you find a non-text node, you can pass it back into the function to keep looking.

function innerText(element) {
  function getTextLoop(element) {
    const texts = [];
    Array.from(element.childNodes).forEach(node => {
      if (node.nodeType === 3) {
        texts.push(node.textContent.trim());
      } else {
        texts.push(...getTextLoop(node));
      }
    });
    return texts;
  }
  return getTextLoop(element).join(' ');
}

/* EXAMPLES */
const div = document.createElement('div');
div.innerHTML = `<p>This <b>That</b><br/>Other</p>`;
console.log(innerText(div));

const div2 = document.createElement('div');
div2.innerHTML = `<div><h1>Title</h1><p>Text</p></div>`;
console.log(innerText(div2));

We have a winner... that's exactly what I needed. Was in the middle of trying to write something similar, but you got there faster with a much nicer bit of code. Thanks — freefaller, Dec 02 '19 at 18:19

score 0 · Answer 2 · answered Dec 02 '19 at 18:05

0

If you are just worried about br tags, you can replace them with a text node.

var elem = document.querySelector("#text")
var clone = elem.cloneNode(true)
clone.querySelectorAll("br").forEach( function (br) {
  var space = document.createTextNode(' ')
  br.replaceWith(space)
})
var cleanedText = clone.textContent.trim().replace(/\s+/,' ');
console.log(cleanedText)

<div id="text">
  <p>This <b>That</br>Other</p>
</div>

answered Dec 02 '19 at 18:05

epascarello

204,599
20
195
236

It's not just `
` tags... it's anything where elements are "joined"... such as the extra example I added: `
Title
Text
`. Sarvesh has got me thinking recursion over children is potentially the way to go – freefaller Dec 02 '19 at 18:07
basic tree walker – epascarello Dec 02 '19 at 18:08

Get text from HTML with appropriate whitespace

2 Answers2

Title

Linked