68

I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:

function extractContent(value) {
  var content_holder = "";

  for (var i = 0; i < value.length; i++) {
    if (value.charAt(i) === '>') {
      continue;
      while (value.charAt(i) != '<') {
        content_holder += value.charAt(i);
      }
    }

  }
  console.log(content_holder);
}

extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");

The problem is that nothing gets printed on the console(*content_holder* stays empty). I think the problem is caused by the === operator.

Penny Liu
  • 15,447
  • 5
  • 79
  • 98
Toshkuuu
  • 805
  • 1
  • 7
  • 9
  • 3
    Your `while` loop is never reached due to the `continue` instruction. – Arnaud Christ Mar 06 '15 at 13:13
  • Try tracing through your code with a "debugger"--did you do that? –  Mar 06 '15 at 13:22
  • Possible duplicate of [JS: Extract text from a string without jQuery](https://stackoverflow.com/questions/17776680/js-extract-text-from-a-string-without-jquery) – Rehan Haider May 11 '18 at 07:42
  • also similar: https://stackoverflow.com/questions/10585029/parse-an-html-string-with-js – Akber Iqbal Jun 14 '18 at 06:30
  • Does this answer your question? [Get the pure text without HTML element by javascript?](https://stackoverflow.com/questions/6743912/get-the-pure-text-without-html-element-by-javascript) – KyleMit Jan 11 '20 at 20:08

11 Answers11

127

Create an element, store the HTML in it, and get its textContent:

function extractContent(s) {
  var span = document.createElement('span');
  span.innerHTML = s;
  return span.textContent || span.innerText;
};
    
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));

Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:

function extractContent(s, space) {
  var span= document.createElement('span');
  span.innerHTML= s;
  if(space) {
    var children= span.querySelectorAll('*');
    for(var i = 0 ; i < children.length ; i++) {
      if(children[i].textContent)
        children[i].textContent+= ' ';
      else
        children[i].innerText+= ' ';
    }
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
    
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>"));

console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>.  Nice to <em>see</em><strong><em>you!</em></strong>",true));
thedayturns
  • 9,723
  • 5
  • 33
  • 41
Rick Hitchcock
  • 35,202
  • 5
  • 48
  • 79
75

One line (more precisely, one statement) version:

function extractContent(html) {
    return new DOMParser()
        .parseFromString(html, "text/html")
        .documentElement.textContent;
}
DollarAkshay
  • 2,063
  • 1
  • 21
  • 39
  • 1
    nice answer +1, but what is the difference between your answer and `Rick Hitchcock` answer – Sharique Ansari Mar 06 '15 at 14:04
  • 1
    @shariqueansari, `DOMParser` is "experimental technology" but likely to be added to the spec. Its HTML support works in IE10+. My original answer worked in IE9+, but I've now updated it to support IE8. – Rick Hitchcock Mar 06 '15 at 14:53
  • 1
    DOMParser now has wide support, see https://caniuse.com/#search=domparser – Optimae Jun 29 '18 at 01:02
  • 2
    hoped this would work on nodejs but it doesnt. ended up using https://www.npmjs.com/package/html2plaintext – Flion Jan 14 '19 at 08:57
  • Can We use this method for extract some contents by id like: document.getElementById ? – Hamid Araghi Mar 29 '19 at 09:03
36

textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:

let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');
Mubeen Khan
  • 997
  • 1
  • 10
  • 11
  • I know this is a very old comment, but could you please explain the meaning of the expression /<[^>]+>/g ? I'm having trouble understanding what each individual character means. – Kelly Jul 30 '19 at 13:59
  • @Kelly The symbols you are referring to are a *regular expression*. It's kind of like a mini-programming language for parsing text. Here's a link to where you can learn more about each symbol: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions – Kade Dec 27 '19 at 20:46
  • It essentially says to find and remove each *<* that has stuff that is not a *>* between it and a *>*. – Kade Dec 27 '19 at 20:54
  • most helpful, regex, one of the best tool/mini-language for coders. – GD- Ganesh Deshmukh May 06 '20 at 11:52
  • Different technique for different cases, and this is the right approach for my case, Telegram's bot development that require no innerHTML or something that required in web development. – hanism Dec 09 '20 at 01:11
8

use this regax for remove html tags and store only the inner text in html

it shows the HelloW3c only check it

var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');
Rana Ahmer Yasin
  • 437
  • 3
  • 17
  • please give me a reason please? – Rana Ahmer Yasin Mar 06 '15 at 13:23
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 –  Mar 06 '15 at 13:25
  • 1
    If you are going to use regexp, then a simpler version would be `/<[\s\S]*?>/`, or `/<[^]*?>/`. Your `m` flag accomplishes nothing; it relates to the behavior of `^` and `$`. –  Mar 06 '15 at 14:03
2

Try This:-

<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
        var div = document.createElement('div')
        div.innerHTML=value;
        var text= div.textContent;            
        return text;
}
window.onload=function()
{
   alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>
Sharique Ansari
  • 1,458
  • 1
  • 12
  • 22
2

For Node.js

This will use the jsdom library, since node.js doesn't have dom features as in browser.

import * as jsdom from "jsdom";

const html = "<h1>Testing<h1>";
const text = new jsdom.JSDOM(html).window.document.textContent;

console.log(text);
Abraham
  • 12,140
  • 4
  • 56
  • 92
0

You could temporarily write it out to a block level element that is positioned off the page .. some thing like this:

HTML:

<div id="tmp" style="position:absolute;top:-400px;left:-400px;">
</div>

JavaScript:

<script type="text/javascript">
function extractContent(value){
        var div=document.getElementById('tmp');
        div.innerHTML=value;
        console.log(div.children[0].innerHTML);//console out p
}

extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
</script>
Adam MacDonald
  • 1,958
  • 15
  • 19
  • 1
    Right approach, but you don't need an element in the DOM to do this. Just create an element with `var div = document.createElement('div')` and proceed from there. –  Mar 06 '15 at 13:12
  • Also, this will fail with nested HTML elements, such as `

    HelloBob

    ...`. It will retain the markup inside the `p` element.
    –  Mar 06 '15 at 13:23
0

Using jQuery, in jQuery we can add comma seperated tags.

var readableText = [];
$("p, h1, h2, h3, h4, h5, h6").each(function(){ 
     readableText.push( $(this).text().trim() );
})
console.log( readableText.join(' ') );
Joy
  • 37
  • 5
0

Use match() function to bring out HTML tags

const text = `<div>Hello World</div>`;
console.log(text.match(/<[^>]*?>/g));
Deepak Singh
  • 749
  • 4
  • 16
0

Based on Rick Hitchcock answer AND KevBot's, this is how I found the best way to do it :

function getTextLoop(element: HTMLElement | ChildNode) {
  const texts = [];
  Array.from(element.childNodes).forEach((node) => {
    if (node.nodeType === 3) {
      texts.push(node.textContent.trim());
    } else {
      texts.push(...getTextLoop(node));
    }
  });
  return texts;
}

function innerText(element: HTMLElement) {
  return getTextLoop(element).join(" ");
}

export function extractContent(s, space) {
  var span = document.createElement("span");
  span.innerHTML = s;
  if (space) {
    span.innerHTML = innerText(span);
  }
  return [span.textContent || span.innerText].toString().replace(/ +/g, " ");
}

Example :

extractContent("<div>foo<div>bar</div></div>", true); // foo bar
Bardelman
  • 2,176
  • 7
  • 43
  • 70
-2

you need array to hold values

  function extractContent(value) {
var content_holder = new Array();

for(var i=0;i<value.length;i++) {
    if(value.charAt(i) === '>') {
        continue;
        while(value.charAt(i) != '<') {
            content_holder.push(value.charAt(i));
            console.log(content_holder[i]);
        }
    }
}
}extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
Dane
  • 83
  • 5