14

Let's say I have the following:

$ = cheerio.load('<html><body><ul><li>One</li><li>Two</li></body></html>');

var t = $('html').find('*').contents().filter(function() {
  return this.type === 'text';
}).text(); 

I get:

OneTwo

Instead of:

One Two

It's the same result I get if I do $('html').text(). So basically what I need is to inject a separator like (space) or \n

Notice: This is not a jQuery front-end question is more like NodeJS backend related issue with Cheerio and HTML parsing.

Crisboot
  • 1,420
  • 2
  • 18
  • 29

4 Answers4

25

This seems to do the trick:

var t = $('html *').contents().map(function() {
    return (this.type === 'text') ? $(this).text() : '';
}).get().join(' ');

console.log(t);

Result:

One Two

Just improved my solution a little bit:

var t = $('html *').contents().map(function() {
    return (this.type === 'text') ? $(this).text()+' ' : '';
}).get().join('');
Crisboot
  • 1,420
  • 2
  • 18
  • 29
  • This solution works well until there is inline javascript within a page body, which for some reason it pull in. Any idea how to resolve? – tremor Sep 11 '20 at 17:43
  • @tremor I think you could use `.prop("innerText")` instead of `.text()`. – Michael Haar Oct 15 '22 at 07:02
4

You can use the TextVersionJS package to generate the plain text version of an html string. You can use it on the browser and in node.js as well.

var createTextVersion = require("textversionjs");

var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

Download it from npm and require it with Browserify for example.

Balint
  • 49
  • 2
3

You can use the following function to extract the text from an html separated by a whitespace :

function extractTextFromHtml(html: string): string {
  const cheerioStatic: CheerioStatic = cheerio.load(html || '');

  return cheerioStatic('html *').contents().toArray()
    .map(element => element.type === 'text' ? cheerioStatic(element).text().trim() : null)
    .filter(text => text)
    .join(' ');
}
  • 1
    ...content().toArray().map(element => {}) . After applying toArray() to the contents, it worked for me. Thanks! – iconique Apr 30 '20 at 02:39
0

Existing solutions seem a bit vague, selecting "*" and using .contents() and text node filtering. More straightforward is to directly select the elements we want and map them to their text, then optionally join on spaces:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const html = "<ul><li>One</li><li>Two</li>";
const $ = cheerio.load(html);
const data = $("li").get().map(e => $(e).text().trim()).join(" ");
console.log(data); // => One Two
ggorlen
  • 44,755
  • 7
  • 76
  • 106