I am trying to implement streaming of UTF-8 characters from a file. This is what I've got so far, please excuse the ugly code for now.
use std::fs::File;
use std::io;
use std::io::BufRead;
use std::str;
fn main() -> io::Result<()> {
let mut reader = io::BufReader::with_capacity(100, File::open("utf8test.txt")?);
loop {
let mut consumed = 0;
{
let buf = reader.fill_buf()?;
println!("buf len: {}", buf.len());
match str::from_utf8(&buf) {
Ok(s) => {
println!("====\n{}", s);
consumed = s.len();
}
Err(err) => {
if err.valid_up_to() == 0 {
println!("1. utf8 decoding failed!");
} else {
match str::from_utf8(&buf[..err.valid_up_to()]) {
Ok(s) => {
println!("====\n{}", s);
consumed = s.len();
}
_ => println!("2. utf8 decoding failed!"),
}
}
}
}
}
if consumed == 0 {
break;
}
reader.consume(consumed);
println!("consumed {} bytes", consumed);
}
Ok(())
}
I have a test file with a multibyte character at offset 98 which fails to decode as it does not fit completely into my (arbitrarily-sized) 100 byte buffer. That's fine, I just ignore it and decode what is valid up to the start of that character.
The problem is that after calling consume(98)
on the BufReader
, the next call to fill_buf()
only returns 2 bytes... it seems to have not bothered to read any more bytes into the buffer. I don't understand why. Maybe I have misinterpreted the documentation.
Here is the sample output:
buf len: 100
====
UTF-8 encoded sample plain-text file
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
consumed 98 bytes
buf len: 2
1. utf8 decoding failed!
It would be nice if from_utf8()
would return the partially decoded string and the position of the decoding error so I don't have to call it twice whenever this happens, but there doesn't seem to be such a function in the standard library (that I am aware of).