15

I'm writing a parser that should extract "Extract This Text" from the following html:

<div class="a">
    <h1>some random text</h1>
    <div class="clear"></div>
    Extract This Text
    <p></p>
    <h2></h2>
</div>

I've tried to use:

document.querySelector('div.a > :nth-child(3)');

And even by using next sibling:

document.querySelector('div.a > :nth-child(2) + *');

But they both skips it and returns only the "p" element.

The only solution I see here is selecting the previous node and then using nextSibling to access it.

Can querySelector select text nodes at all?
Text node: https://developer.mozilla.org/en-US/docs/Web/API/Text

icl7126
  • 5,740
  • 4
  • 53
  • 51
  • 3
    My workaround is to use the `querySelector` to select the element and then extract the `#text` node with `Array.from(element.childNodes).find(node => node.nodeName === '#text')` – Draško Kokić Jan 06 '20 at 12:34
  • In this case, the Text node is the 3rd ChildNode, so you can access its text this way: `element.childNodes[2].textContent` – kol May 11 '21 at 18:02

3 Answers3

16

As already answered, CSS does not provide text node selectors and thus document.querySelector doesn't.

However, JavaScript does provide an XPath-parser by the method document.evaluate which features many more selectors, axises and operators, e.g. text nodes as well.

let result = document.evaluate(
  '//div[@class="a"]/div[@class="clear"]/following-sibling::text()[1]',
  document,
  null,
  XPathResult.STRING_TYPE
).stringValue;

console.log(result.trim());
<body>
  <div class="a">
    <h1>some random text</h1>
    <div class="clear"></div>
    Extract This Text
    <p></p>
    But Not This Text
    <h2></h2>
  </div>
</body>

// means any number of ancestor nodes.
/html/body/div[@class="a"] would address the node absolutely.

It should be mentioned that CSS queries work much more performant than the very powerful XPath evaluation. Therefore, avoid the excessive usage of document.evaluate when document.querySelectorAll works as well. Reserve it for the cases where you really need to parse the DOM by complex expressions.

Pinke Helga
  • 6,378
  • 2
  • 22
  • 42
  • 1
    Amazing! This is exactly what I should have been using from the start. Thanks! [MDN docs for Document.evaluate()](https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate) – icl7126 May 03 '20 at 18:56
  • 1
    @icl7126 Thank you! I've added a performance notice. You should decide from case to case which method to use. – Pinke Helga May 09 '20 at 01:02
  • would this be more performant than recursing into an entire DOM structure to find all the Text nodes it contained? – Michael Oct 14 '21 at 23:46
  • @Michael I guess so, since it is a builtin. However, I never have done a performance test. – Pinke Helga Apr 07 '22 at 13:57
3

It can't, though my answer isn't that authoritative. ( You may have figure it out)

You can check out this select text node with CSS or Is there a CSS selector for text nodes.

Some verbose explaination(maybe useless, English is not my first language, sorry for some misusing of words or grammar.):

I was learning about ParentNode and since the querySelectorAll() method returning a NodeList, I was wondering if it could select text node. I tried but failed; googled and found this post.

Argument in querySelectorAll(selectors) or querySelector(selectors) is a DOMString containing one or more CSS selectors (of course no containing pseudo-element, otherwise the method would return null) which only apply to elements (not plain text).

kiz
  • 47
  • 1
  • 8
3

Not directly, no. But you can access it from its parent:

const parent = document.querySelector('div.a')

const textNodes = [...parent.childNodes] // has childNodes inside, including text ones
  .filter(child => child.nodeType === 3) // get only text nodes
  .filter(child => child.textContent.trim()) // eliminate empty text
  .map(textNode => textNode.textContent) // extract text content

console.log(textNodes[0])
// "Extract This Text"

// make it a function
const extractText = (DOMElement) => [...DOMElement.childNodes] // has childNodes inside, including text ones
  .filter(child => child.nodeType === 3) // get only text nodes
  .filter(child => child.textContent.trim()) // eliminate empty text
  .map(textNode => textNode.textContent) // extract text content

console.log(extractText(document.querySelector('div.a'))[0])
// "Extract This Text"
}
Bernardo Dal Corno
  • 1,858
  • 1
  • 22
  • 27