1

I am using cheerio to parse HTML code in different nodes. I can easily do $("*"), but this gets me only normal HTML nodes and not the separate text nodes. Lets consider 3 user inputs:

One:

text only

I need: single text node.

Two:

<div>
  text 1
  <div>
    inner text
  </div>
  text 2
</div>

I need: text node + div node + text node in same sequence.

Three:

<div>
  <div>
    inner text 1
    <div>
      inner text 2
    </div>
  </div>
  <div>
    inner text 3
  </div>
</div>

I need: 2 div nodes

Possible?

Rehmat
  • 2,121
  • 2
  • 24
  • 28
  • For those looking for just the top-level text nodes in a tag, see [How to get a text that's separated by different HTML tags in Cheerio](https://stackoverflow.com/a/73692854/6243352) – ggorlen Sep 12 '22 at 18:11

3 Answers3

2

In hope to help someone, filter function seems to return text nodes also. I got help from this answer: https://stackoverflow.com/a/6520267/3800042

var $ = cheerio.load(tree);
var iterate = function(node, level) {
  if (typeof level === "undefined") level = "--";
  var list = $(node).contents().filter(function() { return true; });
  for (var i=0; i<=list.length-1; i++) {
    var item = list[i];
    console.log(level, "(" + i + ")", item.type, $(item).text());
    iterate(item, level + "--");
  }
}
iterate($.root());

HTML input

<div>
  text 1
  <div>
    inner text
  </div>
  text 2
</div>

Result

-- (0) tag 

  text 1



    inner text



  text 2


---- (0) text 

  text 1



---- (1) tag 

    inner text



------ (0) text 

    inner text



---- (2) text 

  text 2
Rehmat
  • 2,121
  • 2
  • 24
  • 28
  • 1
    "`filter` function seems to return text nodes also." - This is NOT true. The filter line in your code does nothing: `.filter(function() { return true; });`. You need to filter out non-text types: `.filter(function() { return this.nodeType == Node.TEXT_NODE; });` – Antoine Dahan Jan 08 '21 at 21:49
0

I hope the following codes can help you.

const cheerio = require("cheerio");
const htmlText = `<ul id="fruits">
  <!--This is a comment.-->
  <li class="apple">Apple</li>
  Peach
  <li class="orange">Orange</li>
  <li class="pear">Pear</li>
</ul>`;

const $ = cheerio.load(htmlText);
const contents = $('ul#fruits').contents();
console.log(contents.length);// 9, since nodes like '\n' are included 
console.log(new RegExp('^\\s*$').test('\n '));
function isWhitespaceTextNode(node){
    if(node.type !== 'text'){
        return false;
    }
    if(new RegExp('^\\s*$').test(node.data)){
        return true;
    }
    return false;
}
//Note here: filter is a function provided by cheerio, not Array.filter
const nonWhitespaceTextContents = contents.filter(nodeIndex=>{
    const node = contents[nodeIndex];
    if(isWhitespaceTextNode(node)){
        return false;
    }else{
        return true;
    }
});
console.log(nonWhitespaceTextContents.length);// 5, since nodes like '\n ' are excluded
nonWhitespaceTextContents.each((_, node)=>console.log(node));
//[comment node]
//[li node] apple
//[text node] peach
//[li node] orange
//[li node] pear
TTY112358
  • 134
  • 6
0

If you want all of the immediate children of a node, both text nodes and tag nodes, use .contents() and filter out whitespace-only text nodes.

Here's the code running on your examples:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const tests = [
  // added a div container to make the parent selector consistent
  `<div>text only</div>`,

  `<div>
    text 1
    <div>
      inner text
    </div>
    text 2
  </div>`,

  `<div>
    <div>
      inner text 1
      <div>
        inner text 2
      </div>
    </div>
    <div>
      inner text 3
    </div>
  </div>`
];

tests.forEach(html => {
  const $ = cheerio.load(html);
  const result = [...$("div").first().contents()]
    .filter(e => e.type !== "text" || $(e).text().trim())

    // the following is purely for display purposes
    .map(e => e.type === "text" ? $(e).text().trim() : e.tagName);

  console.log(result);
});

Output:

[ 'text only' ]
[ 'text 1', 'div', 'text 2' ]
[ 'div', 'div' ]

If you only want the text nodes and not the tags, see How to get a text that's separated by different HTML tags in Cheerio.

ggorlen
  • 44,755
  • 7
  • 76
  • 106