0

I would like to iterate over some network files (tsv.gz), parse them (load each row), and only write portions (i.e. columns) to files, i.e. https://datasets.imdbws.com/ (ideally with flate2), but I can't seem to find any idioms for iterating over files from URIs. Should I use an external package like hyper and try to iterate over Body? If so, how can I convert a Body into something that implements Read? Here is some base code:

use flate2::read::GzDecoder;
use hyper::Client;
use std::io::BufReader;

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    let client = Client::new();
    let uri = "http://datasets.imdbws.com/title.basics.tsv.gz".parse()?;
    let body = client.get(uri).await?.into_body();
    let d = GzDecoder::new(body); // hyper::Body doesn't implement Read
    for line in BufReader::new(d).lines() {
        // do something with lines
    }
    Ok(())
}
Dan Jenson
  • 961
  • 7
  • 20
  • What exactly are you trying to iterate over? Can you clarify your question with an example of how you would use such a function? – Coder-256 Aug 14 '20 at 02:39
  • @coder-256 I've updated, but just tsv.gzs – Dan Jenson Aug 14 '20 at 03:58
  • I would edit your question to make it clear that the problem you are trying to solve is: how to download and iterate over a gzip-compressed, tab-separated values file – Coder-256 Aug 14 '20 at 07:04
  • I think you have the right idea with `flate2` and `hyper` (although maybe `reqwest` would be more suited to this use case). What exactly is your question, just how to iterate over lines of a file? That has already been asked and answered, [see here](https://stackoverflow.com/a/45882510/3398839). – Coder-256 Aug 14 '20 at 07:10
  • The `reqwest::Response` object implements `Read`, so you can wrap it in a `BufReader` and treat it like any other file. The fact that the file is streamed is transparent. – Sven Marnach Aug 14 '20 at 08:05
  • @SvenMarnach, it doesn't look like `reqwest::Response` implements `Read`: https://docs.rs/reqwest/0.10.7/reqwest/struct.Response.html – Dan Jenson Aug 14 '20 at 14:22
  • @DanJenson Looks like the blocking version of the response has been moved to the `blocking` module: https://docs.rs/reqwest/0.10.7/reqwest/blocking/struct.Response.html#impl-Read The async version indeed doesn't implement `Read`, since it wouldn't make sense for it to do so. – Sven Marnach Aug 14 '20 at 18:46
  • Until version 0.9, `reqwest::Response` was the blocking version: https://docs.rs/reqwest/0.9.24/reqwest/struct.Response.html – Sven Marnach Aug 14 '20 at 18:52

0 Answers0