14

I would like to parse a web page, insert anchors at certain positions and render the modified DOM out again in order to generate docsets for Dash. Is this possible?

From the examples included in html5ever, I can see how to read an HTML file and do a poor man's HTML output, but I don't understand how I can modify the RcDom object I retrieved.

I would like to see a snippet inserting an anchor element (<a name="foo"></a>) to an RcDom.

Note: this is a question regarding Rust and html5ever specifically ... I know how to do it in other languages or simpler HTML parsers.

kesselborn
  • 533
  • 5
  • 7
  • It's easier to parse HTML with the higher-level [scraper](https://github.com/programble/scraper) or [kuchiki](https://crates.io/crates/kuchiki) rather than html5ever directly. – Wilfred Hughes Nov 19 '19 at 17:55

1 Answers1

17

Here is some code that parses a document, adds an achor to the link and prints the new document:

extern crate html5ever;

use html5ever::{ParseOpts, parse_document};
use html5ever::tree_builder::TreeBuilderOpts;
use html5ever::rcdom::RcDom;
use html5ever::rcdom::NodeEnum::Element;
use html5ever::serialize::{SerializeOpts, serialize};
use html5ever::tendril::TendrilSink;

fn main() {
    let opts = ParseOpts {
        tree_builder: TreeBuilderOpts {
            drop_doctype: true,
            ..Default::default()
        },
        ..Default::default()
    };
    let data = "<!DOCTYPE html><html><body><a href=\"foo\"></a></body></html>".to_string();
    let dom = parse_document(RcDom::default(), opts)
        .from_utf8()
        .read_from(&mut data.as_bytes())
        .unwrap();

    let document = dom.document.borrow();
    let html = document.children[0].borrow();
    let body = html.children[1].borrow(); // Implicit head element at children[0].

    {
        let mut a = body.children[0].borrow_mut();
        if let Element(_, _, ref mut attributes) = a.node {
            attributes[0].value.push_tendril(&From::from("#anchor"));
        }
    }

    let mut bytes = vec![];
    serialize(&mut bytes, &dom.document, SerializeOpts::default()).unwrap();
    let result = String::from_utf8(bytes).unwrap();
    println!("{}", result);
}

This prints the following:

<html><head></head><body><a href="foo#anchor"></a></body></html>

As you can see, we can navigate through the child nodes via the children attribute.

And we can change an attribute present in the vector of attributes of an Element.

antoyo
  • 11,097
  • 7
  • 51
  • 82
  • Thanks a lot, exactly what I was hoping for. – kesselborn Aug 10 '16 at 21:36
  • 6
    1 year old answer, but I have just tried this code today and it fails to compile for me. I am on Rust 1.20.0 and using the latest version of html5ever. The error is `unresolved import html5ever::rcdom::NodeEnum::Element` and it says it does not find NodeEnum anymore. Was it deprecated ? Did I miss something ? – ghlecl Nov 09 '17 at 14:56
  • Look at this example - it uses another data structures which look more fresh: https://github.com/servo/html5ever/blob/master/html5ever/examples/print-rcdom.rs – kirhgoff May 29 '18 at 16:29
  • 1
    Updated example link (I think): https://github.com/servo/html5ever/blob/master/rcdom/examples/print-rcdom.rs – thomasa88 Jul 28 '21 at 08:32