I have this example HTML which I want to parse with kuchiki:
<a href="https://example.com"><em>@</em>Bananowy</a>
I want only Bananowy
without @
.
A similar question for JavaScript: How to get the text node of an element?
I have this example HTML which I want to parse with kuchiki:
<a href="https://example.com"><em>@</em>Bananowy</a>
I want only Bananowy
without @
.
A similar question for JavaScript: How to get the text node of an element?
First, let's start with how a parser will parse:
<a href="https://example.com"><em>@</em>Bananowy</a>
Into a tree. See image below:
Now if you try to do the obvious thing and call anchor.text_contents()
you're going to get all text contents of all the text nodes descendants of the anchor tag (<a>
). This is how text_contents behave according to CSS definition.
However, you want to just get the "Bananowy"
you have few ways to do it:
extern crate kuchiki;
use kuchiki::traits::*;
fn main() {
let html = r"<a href='https://example.com'><em>@</em>Bananowy</a>";
let document = kuchiki::parse_html().one(html);
let selector = "a";
let anchor = document.select_first(selector).unwrap();
// Quick and dirty hack
let last_child = anchor.as_node().last_child().unwrap();
println!("{:?}", last_child.into_text_ref().unwrap());
// Iterating solution
for children in anchor.as_node().children() {
if let Some(a) = children.as_text() {
println!("{:?}", a);
}
}
// Iterating solution - Using `text_nodes()` iterators
anchor.as_node().children().text_nodes().for_each(|e| {
println!("{:?}", e);
});
// text1 and text2 are examples how to get `String`
let text1 = match anchor.as_node().children().text_nodes().last() {
Some(x) => x.as_node().text_contents(),
None => String::from(""),
};
let text2 = match anchor.as_node().children().text_nodes().last() {
Some(x) => x.borrow().clone(),
None => String::from(""),
};
}
First way is the brittle, hackish way. All you need to realize is that "Bananowy"
is the last_child of your anchor tag, and fetch it accordingly anchor.as_node().last_child().unwrap().into_text_ref().unwrap()
.
The second solution is to iterate over anchor tag's children (i.e. [Tag(em), TextNode("Bananowy")]
) and select only text nodes using (as_text()
method). We do this with method as_text()
that returns None
for all Nodes
that aren't TextNode
. This is way less fragile than the first solution which won't work if e.g. you had <a><em>@</em>Banan<i>!</i>owy</a>
.
EDIT:
After looking around for a bit I found a much better solution to your problem. It's called TextNodes iterator.
With that in mind just write anchor.as_node().children().text_nodes().<<ITERATOR CODE GOES HERE>>;
and then map or manipulate the entries as you see fit.
Why is this solution better? It's more succinct, it uses the good old fashioned Iterator
so it's very similar to the answer in JS you gave above.