I'm trying to parse the HTML response of an HTTP request. I'm using hyper for the requests and html5ever for the parsing. The HTML will be pretty large and I don't need to fully parse it -- I just need to identify some data from tags so I would prefer to stream it. Conceptually I want to do something like:
# bash
curl url | read_dom
/* javascript */
http.get(url).pipe(parser);
parser.on("tag", /* check tag name, attributes, and act */)
What I have come up with so far is:
extern crate hyper;
extern crate html5ever;
use std::default::Default
use hyper::Client;
use html5ever::parse_document;
use html5ever::rcdom::{RcDom};
fn main() {
let client = Client::new();
let res = client.post(WEBPAGE)
.header(ContentType::form_url_encoded())
.body(BODY)
.send()
.unwrap();
res.read_to_end(parse_document(RcDom::default(),
Default::default().from_utf8().unwrap()));
}
It seems like read_to_end
is the method I want to call on the response to read the bytes, but it is unclear to me how to pipe this to the HTML document reader ... if this is even possible.
The documentation for parse_document
says to use from_utf8
or from_bytes
if the input is in bytes (which it is).
It seems that I need to create a sink from the response, but this is where I am stuck. It's also unclear to me how I can create events to listen for tag starting which is what I am interested in.
I've looked at this example of html5ever which seems to do what I want and walks the DOM, but I can't get this example itself to run -- either it's outdated or tendril/html5ever is too new. This also seems to parse the HTML as a whole rather than as a stream, but I'm not sure.
Is it possible to do what I want to do with the current implementation of these libraries?