3

I am exploring Rust and trying to make a simple HTTP request (using the hyper crate) and print the response body to the console. The response implements std::io::Read. Reading various documentation sources and basic tutorials, I have arrived at the following code, which I compile & execute using RUST_BACKTRACE=1 cargo run:

use hyper::client::Client;
use std::io::Read;

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(mut res) => {
            println!("{}", res.status);

            let mut body = String::new();

            match res.read_to_string(&mut body) {
                Ok(body) => println!("{:?}", body),
                Err(why) => panic!("String conversion failure: {:?}", why)
            }
        },
        Err(why) => panic!("{:?}", why)
    }
}

Expected:

A nice, human-readable HTML content of the body, as delivered by the HTTP server, is printed to the console.

Actual:

200 OK
thread '<main>' panicked at 'String conversion failure: Error { repr: Custom(Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }) }', src/printer.rs:16
stack backtrace:
   1:        0x109e1faeb - std::sys::backtrace::tracing::imp::write::h3800f45f421043b8
   2:        0x109e21565 - std::panicking::default_hook::_$u7b$$u7b$closure$u7d$$u7d$::h0ef6c8db532f55dc
   3:        0x109e2119e - std::panicking::default_hook::hf3839060ccbb8764
   4:        0x109e177f7 - std::panicking::rust_panic_with_hook::h5dd7da6bb3d06020
   5:        0x109e21b26 - std::panicking::begin_panic::h9bf160aee246b9f6
   6:        0x109e18248 - std::panicking::begin_panic_fmt::haf08a9a70a097ee1
   7:        0x109d54378 - libplayground::printer::print_html::hff00c339aa28fde4
   8:        0x109d53d76 - playground::main::h0b7387c23270ba52
   9:        0x109e20d8d - std::panicking::try::call::hbbf4746cba890ca7
  10:        0x109e23fcb - __rust_try
  11:        0x109e23f65 - __rust_maybe_catch_panic
  12:        0x109e20bb1 - std::rt::lang_start::hbcefdc316c2fbd45
  13:        0x109d53da9 - main
error: Process didn't exit successfully: `target/debug/playground` (exit code: 101)

Thoughts

Since I received 200 OK from the server, I believe I have received a valid response from the server (I can also empirically prove this by doing the same request in a more familiar programming language). Therefore, the error must be caused by me incorrectly converting the byte sequence into an UTF-8 string.

Alternatives

I also attempted the following solution, which gets me to a point where I can print the bytes to the console as a series of hex strings, but I know that this is fundamentally wrong because a UTF-8 character can have 1-4 bytes. Therefore, attempting to convert individual bytes into UTF-8 characters in this example will work only for a very limited (255, to be exact) subset of UTF-8 characters.

use hyper::client::Client;
use std::io::Read;

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(res) => {
            println!("{}", res.status);

            for byte in res.bytes() {
                print!("{:x}", byte.unwrap());
            }
        },
        Err(why) => panic!("{:?}", why)
    }
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Robert Rossmann
  • 11,931
  • 4
  • 42
  • 73

2 Answers2

5

We can confirm with the iconv command that the data returned from http://www.google.com is not valid UTF-8:

$ wget http://google.com -O page.html
$ iconv -f utf-8 page.html > /dev/null
iconv: illegal input sequence at position 5591

For some other urls (like http://www.reddit.com) the code works fine.

If we assume that the most part of the data is valid UTF-8, we can use String::from_utf8_lossy to workaround the problem:

pub fn print_html(url: &str) {
    let client = Client::new();
    let req = client.get(url).send();

    match req {
        Ok(mut res) => {
            println!("{}", res.status);

            let mut body = Vec::new();

            match res.read_to_end(&mut body) {
                Ok(_) => println!("{:?}", String::from_utf8_lossy(&*body)),
                Err(why) => panic!("String conversion failure: {:?}", why),
            }
        }
        Err(why) => panic!("{:?}", why),
    }
}

Note that that Read::read_to_string and Read::read_to_end return Ok with the number of read bytes on success, not the read data.

malbarbo
  • 10,717
  • 1
  • 42
  • 57
  • Bad Google! And I thought it must be the codez. Next time I'll remember to try multiple sites, it indeed works for others. Thanks! And also thanks for the note about return value. – Robert Rossmann Jul 22 '16 at 19:27
5

If you actually look at the headers that Google returns:

HTTP/1.1 200 OK
Date: Fri, 22 Jul 2016 20:45:54 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=82=YwAD4Rj09u6gUA8OtQH73BUz6UlNdeRc9Z_iGjyaDqFdRGMdslypu1zsSDWQ4xRJFyEn9-UtR7U6G7HKehoyxvy9HItnDlg8iLsxzlhNcg01luW3_-HWs3l9S3dmHIVh; expires=Sat, 21-Jan-2017 20:45:54 GMT; path=/; domain=.google.ca; HttpOnly
Alternate-Protocol: 443:quic
Alt-Svc: quic=":443"; ma=2592000; v="36,35,34,33,32,31,30,29,28,27,26,25"
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

You can see

Content-Type: text/html; charset=ISO-8859-1

Additionally

Therefore, the error must be caused by me incorrectly converting the byte sequence into an UTF-8 string.

There is no conversion to UTF-8 happening. read_to_string simply ensures that the data is UTF-8.

Simply put, assuming that an arbitrary HTML page is encoded in UTF-8 is completely incorrect. At best, you have to parse the headers to find the encoding and then convert the data. This is complicated because there's no real definition for what encoding the headers are in.

Once you have found the correct encoding, you can use a crate such as encoding to properly transform the result into UTF-8, if the result is even text! Remember that HTTP can return binary files such as images.

Community
  • 1
  • 1
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366