4

HTML:

<div class="someclass">
    <h3>First</h3> 
    <strong>Second</strong> 
    <hr>
    Third
    <br>
    Fourth
    <br>
    <em></em>
    ...
</div>

From above div node I want to get all child text nodes after hr ("Third", "Fourth", ... and there might be more)

If I do

document.querySelectorAll('div.someclass>hr~*')

I get NodeList [ br, br, em, ... ] - no text nodes

With below

document.querySelector('div.someclass').textContent

I get all text nodes as single string

I can get each text node as

var third = document.querySelector('div.someclass').childNodes[6].textContent
var fourth = document.querySelector('div.someclass').childNodes[8].textContent

so I tried

document.querySelector('div.someclass').childNodes[5:]  # SyntaxError

and slice()

document.querySelector('div.someclass').childNodes.slice(5)  # TypeError

So is there any way I can get all child text nodes starting from hr node?

UPDATE

I forgot to mention that this question is about web-scraping, but not web-development... I cannot change HTML source code

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
Andersson
  • 51,635
  • 17
  • 77
  • 129

3 Answers3

3

You can get the content and use split with hr to get the html after the hr and then replace this content within a div and you will be able to manipulate this div to get your content:

var content = document.querySelector('.someclass').innerHTML;
content = content.split('<hr>');
content = content[1];

document.querySelector('.hide').innerHTML = content;
/**/

var nodes = document.querySelector('.hide').childNodes;
for (var i = 0; i < nodes.length; i++) {
  console.log(nodes[i].textContent);
}
.hide {
  display: none;
}
<div class="someclass">
  <h3>First</h3>
  <strong>Second</strong>
  <hr> Third
  <br> Fourth
  <br>
  <em></em> ...
</div>
<div class="hide"></div>
Temani Afif
  • 245,468
  • 26
  • 309
  • 415
1

.childNodes includes both text and non-text nodes.

Your syntax error is because you can't do array slicing like [5:] in javascript.

And also a NodeList is array-like...but is not an array...which is why slice doesn't work directly on childNodes.

1) get your NodeList

var nodeList = document.querySelector('.some-class').childNodes;

2) Convert NodeList to actual array

nodes = Array.prototype.slice.call(nodes);

(note in modern ES6 browsers you can do nodes = Array.from(nodes); Also modern browsers have added .forEach support to NodeList objects...so you can directly use .forEach without array conversion on NodeList in modern browsers)

3) Iterate and collect the text nodes you want

This is dependent on your own logic. But you can iterate the nodes and test to see if node.nodeType == Node.TEXT_NODE to see if any given node is a text node.

var foundHr = false,
    results = [];
nodes.forEach(el => {
    if (el.tagName == 'HR') { foundHr = true; }
    else if (foundHr && el.nodeType == Node.TEXT_NODE) {
        results.push(el.textContent);
    }
});
console.log(results);
mattpr
  • 2,504
  • 19
  • 17
0

You may get all text nodes under node using this piece of code:

var walker = document.createTreeWalker(node, NodeFilter.SHOW_TEXT, null, false);
var textNode;
var result = [];
while (textNode = walker.nextNode()) {
    result.push(textNode);
}

And you've got an Array of text nodes, so you can slice() it as you wish:

console.log(result.slice(5));
PRO
  • 169
  • 5