Parsing HTML page content in a stream with hyper and html5ever

Question

I'm trying to parse the HTML response of an HTTP request. I'm using hyper for the requests and html5ever for the parsing. The HTML will be pretty large and I don't need to fully parse it -- I just need to identify some data from tags so I would prefer to stream it. Conceptually I want to do something like:

# bash
curl url | read_dom

/* javascript */
http.get(url).pipe(parser);
parser.on("tag", /* check tag name, attributes, and act */)

What I have come up with so far is:

extern crate hyper;
extern crate html5ever;

use std::default::Default
use hyper::Client;
use html5ever::parse_document;
use html5ever::rcdom::{RcDom};

fn main() {
    let client = Client::new();

    let res = client.post(WEBPAGE)
        .header(ContentType::form_url_encoded())
        .body(BODY)
        .send()
        .unwrap();

    res.read_to_end(parse_document(RcDom::default(),
      Default::default().from_utf8().unwrap()));
}

It seems like read_to_end is the method I want to call on the response to read the bytes, but it is unclear to me how to pipe this to the HTML document reader ... if this is even possible.

The documentation for parse_document says to use from_utf8 or from_bytes if the input is in bytes (which it is).

It seems that I need to create a sink from the response, but this is where I am stuck. It's also unclear to me how I can create events to listen for tag starting which is what I am interested in.

I've looked at this example of html5ever which seems to do what I want and walks the DOM, but I can't get this example itself to run -- either it's outdated or tendril/html5ever is too new. This also seems to parse the HTML as a whole rather than as a stream, but I'm not sure.

Is it possible to do what I want to do with the current implementation of these libraries?

score 8 · Accepted Answer · answered Feb 26 '16 at 20:02

8

Sorry for the lack of tutorial-like documentation for html5ever and tendril…

Unless you’re 100% sure your content is in UTF-8, use from_bytes rather than from_utf8. They return something that implements TendrilSink which allows you to provide the input incrementally (or not).

The std::io::Read::read_to_end method takes a &mut Vec<u8>, so it doesn’t work with TendrilSink.

At the lowest level, you can call the TendrilSink::process method once per &[u8] chunk, and then call TendrilSink::finish.

To avoid doing that manually, there’s also the TendrilSink::read_from method that takes &mut R where R: std::io::Read. Since hyper::client::Response implements Read, you can use:

parse_document(RcDom::default(), Default::default()).from_bytes().read_from(&mut res)

To go beyond your question, RcDom is very minimal and mostly exists in order to test html5ever. I recommend using Kuchiki instead. It has more features (for tree traversal, CSS Selector matching, …) including optional Hyper support.

In your Cargo.toml:

[dependencies]
kuchiki = {version = "0.3.1", features = ["hyper"]}

In your code:

let document = kuchiki::parse_html().from_http(res).unwrap();

answered Feb 26 '16 at 20:02

Simon Sapin

9,790
3
35
44

Can you link me more information about Kuchiki like how to implement tree traversal and specifically how to use things like "open tag" events to inspect tag/text contents? This is what I need to do. – Explosion Pills Feb 26 '16 at 20:12
It looks like the [documentation](https://simonsapin.github.io/kuchiki/kuchiki/struct.Node.html) is buggy, there are more methods that don’t show up there. For example, nodes have methods like `.descendants()` and `.inclusive_descendants()` that return iterators of nodes. I’m not sure what you mean by "open tag". Kuchiki is not event-based, you get a tree data structures once parsing is done. – Simon Sapin Feb 26 '16 at 20:17
Thanks. Too bad if it has to parse the entire document at once. I want something like [htmlparser2 for node](https://github.com/fb55/htmlparser2) where I can pipe a stream of html to the parser and respond to `onstarttag`, etc. – Explosion Pills Feb 26 '16 at 20:20
1

In order to be compatible with legacy web content, a conforming HTML parser needs to do all kinds of complex tree manipulation like the ["adoption agency algorithm"](https://html.spec.whatwg.org/multipage/syntax.html#adoption-agency-algorithm). The only way to do that and have an event-based API is to [buffer the entire document](https://github.com/servo/html5ever/issues/149#issuecomment-120991146), which defeats the point. I suppose you could sacrifice standards compliance, but you risk being incompatible (parsing pages differently) with other parsers like those in web browsers. – Simon Sapin Feb 26 '16 at 22:12
For example code you can run a search for [extern crate kuchiki](https://github.com/search?q=extern+crate+kuchiki&type=Code&utf8=%E2%9C%93) on github. – Feb 06 '17 at 03:19

score 1 · Answer 2 · answered Dec 19 '20 at 20:46

Unless I'm misunderstanding something, processing the HTML tokens is quite involved (and the names of the atom constants are unfortunately very far from perfect). This code demonstrates how to use html5ever version 0.25.1 to process the tokens.

First, we want a String with the HTML body:

let body = {
    let mut body = String::new();
    let client = Client::new();

    client.post(WEBPAGE)
        .header(ContentType::form_url_encoded())
        .body(BODY)
        .send()?
        .read_to_string(&mut body);

    body
};

Second, we need to define our own Sink, which contains the "callbacks" and lets you hold any state you need. For this example, I will be detecting <a> tags and printing them back as HTML (this requires us to detect start tag, end tag, text, and finding an attribute; hopefully a complete-enough example):

use html5ever::tendril::StrTendril;
use html5ever::tokenizer::{
    BufferQueue, Tag, TagKind, Token, TokenSink, TokenSinkResult, Tokenizer,
};
use html5ever::{ATOM_LOCALNAME__61 as TAG_A, ATOM_LOCALNAME__68_72_65_66 as ATTR_HREF};

// Define your own `TokenSink`. This is how you keep state and your "callbacks" run.
struct Sink {
    text: Option<String>,
}

impl TokenSink for Sink {
    type Handle = ();

    fn process_token(&mut self, token: Token, _line_number: u64) -> TokenSinkResult<()> {
        match token {
            Token::TagToken(Tag {
                kind: TagKind::StartTag,
                name,
                self_closing: _,
                attrs,
            }) => match name {
                // Check tag name, attributes, and act.
                TAG_A => {
                    let url = attrs
                        .into_iter()
                        .find(|a| a.name.local == ATTR_HREF)
                        .map(|a| a.value.to_string())
                        .unwrap_or_else(|| "".to_string());

                    print!("<a href=\"{}\">", url);
                    self.text = Some(String::new());
                }
                _ => {}
            },
            Token::TagToken(Tag {
                kind: TagKind::EndTag,
                name,
                self_closing: _,
                attrs: _,
            }) => match name {
                TAG_A => {
                    println!(
                        "{}</a>",
                        self.text.take().unwrap()
                    );
                }
                _ => {}
            },
            Token::CharacterTokens(string) => {
                if let Some(text) = self.text.as_mut() {
                    text.push_str(&string);
                }
            }
            _ => {}
        }
        TokenSinkResult::Continue
    }
}


let sink = {
    let sink = Sink {
        text: None,
    };

    // Now, feed the HTML `body` string to the tokenizer.
    // This requires a bit of setup (buffer queue, tendrils, etc.).
    let mut input = BufferQueue::new();
    input.push_back(StrTendril::from_slice(&body).try_reinterpret().unwrap());
    let mut tok = Tokenizer::new(sink, Default::default());
    let _ = tok.feed(&mut input);
    tok.end();
    tok.sink
};

// `sink` is your `Sink` after all processing was done.
assert!(sink.text.is_none());

score -3 · Answer 3 · answered Feb 26 '16 at 20:03

-3

try to add this:

let mut result: Vec<u8> = Vec::new();

res.read_to_end(&mut result);

let parse_result = parse_document(RcDom::default(), Default::default())
    . //read parameters
    .unwrap();

parameters accordint to crate documentation...

answered Feb 26 '16 at 20:03

Ivan Temchenko

814
1
9
12

Where do you actually use the `result` with the parser? – Explosion Pills Feb 26 '16 at 20:07
i guess in paramiter like `.read_from(&mut result.lock())` as described in documentation you linked... – Ivan Temchenko Feb 26 '16 at 20:12

Parsing HTML page content in a stream with hyper and html5ever

3 Answers3