How to get only TEXT_NODE with kuchiki

Question

I have this example HTML which I want to parse with kuchiki:

<a href="https://example.com"><em>@</em>Bananowy</a>

I want only Bananowy without @.

A similar question for JavaScript: How to get the text node of an element?

What is kuchiki? Can you please expand your question a little bit? E.g. some code example? — hellow, May 27 '19 at 17:27

score 3 · Accepted Answer · edited Jan 07 '20 at 11:01

First, let's start with how a parser will parse:

    <a href="https://example.com"><em>@</em>Bananowy</a>

Into a tree. See image below:

Now if you try to do the obvious thing and call anchor.text_contents() you're going to get all text contents of all the text nodes descendants of the anchor tag (<a>). This is how text_contents behave according to CSS definition.

However, you want to just get the "Bananowy" you have few ways to do it:

extern crate kuchiki;

use kuchiki::traits::*;

fn main() {
    let html = r"<a href='https://example.com'><em>@</em>Bananowy</a>";

    let document = kuchiki::parse_html().one(html);

    let selector = "a";
    let anchor = document.select_first(selector).unwrap();
    // Quick and dirty hack
    let last_child = anchor.as_node().last_child().unwrap();
    println!("{:?}", last_child.into_text_ref().unwrap());

    // Iterating solution
    for children in anchor.as_node().children() {
        if let Some(a) = children.as_text() {
            println!("{:?}", a);
        }
    }

    // Iterating solution - Using `text_nodes()` iterators
    anchor.as_node().children().text_nodes().for_each(|e| {
        println!("{:?}", e);
    });

    // text1 and text2 are examples how to get `String`
    let text1 = match anchor.as_node().children().text_nodes().last() {
        Some(x) => x.as_node().text_contents(),
        None => String::from(""),
    };

    let text2 = match anchor.as_node().children().text_nodes().last() {
        Some(x) => x.borrow().clone(),
        None => String::from(""),
    };
}

First way is the brittle, hackish way. All you need to realize is that "Bananowy" is the last_child of your anchor tag, and fetch it accordingly anchor.as_node().last_child().unwrap().into_text_ref().unwrap().

The second solution is to iterate over anchor tag's children (i.e. [Tag(em), TextNode("Bananowy")]) and select only text nodes using (as_text() method). We do this with method as_text() that returns None for all Nodes that aren't TextNode. This is way less fragile than the first solution which won't work if e.g. you had <a><em>@</em>Banan<i>!</i>owy</a>.

EDIT:

PREFERED Solution

After looking around for a bit I found a much better solution to your problem. It's called TextNodes iterator.

With that in mind just write anchor.as_node().children().text_nodes().<<ITERATOR CODE GOES HERE>>; and then map or manipulate the entries as you see fit.

Why is this solution better? It's more succinct, it uses the good old fashioned Iterator so it's very similar to the answer in JS you gave above.

I have found this comment how to navigate documentation: https://www.reddit.com/r/rust/comments/af4ns6/how_to_work_with_refcell/ — rofrol, Jan 06 '20 at 07:01

How to get only TEXT_NODE with kuchiki

1 Answers1

PREFERED Solution