1

I am trying to implement streaming of UTF-8 characters from a file. This is what I've got so far, please excuse the ugly code for now.

use std::fs::File;
use std::io;
use std::io::BufRead;
use std::str;

fn main() -> io::Result<()> {
    let mut reader = io::BufReader::with_capacity(100, File::open("utf8test.txt")?);
    loop {
        let mut consumed = 0;
        {
            let buf = reader.fill_buf()?;
            println!("buf len: {}", buf.len());
            match str::from_utf8(&buf) {
                Ok(s) => {
                    println!("====\n{}", s);
                    consumed = s.len();
                }
                Err(err) => {
                    if err.valid_up_to() == 0 {
                        println!("1. utf8 decoding failed!");
                    } else {
                        match str::from_utf8(&buf[..err.valid_up_to()]) {
                            Ok(s) => {
                                println!("====\n{}", s);
                                consumed = s.len();
                            }
                            _ => println!("2. utf8 decoding failed!"),
                        }
                    }
                }
            }
        }
        if consumed == 0 {
            break;
        }
        reader.consume(consumed);
        println!("consumed {} bytes", consumed);
    }
    Ok(())
}

I have a test file with a multibyte character at offset 98 which fails to decode as it does not fit completely into my (arbitrarily-sized) 100 byte buffer. That's fine, I just ignore it and decode what is valid up to the start of that character.

The problem is that after calling consume(98) on the BufReader, the next call to fill_buf() only returns 2 bytes... it seems to have not bothered to read any more bytes into the buffer. I don't understand why. Maybe I have misinterpreted the documentation.

Here is the sample output:

buf len: 100
====
UTF-8 encoded sample plain-text file
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
consumed 98 bytes
buf len: 2
1. utf8 decoding failed!

It would be nice if from_utf8() would return the partially decoded string and the position of the decoding error so I don't have to call it twice whenever this happens, but there doesn't seem to be such a function in the standard library (that I am aware of).

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Luke McCarthy
  • 879
  • 2
  • 9
  • 21
  • *`from_utf8()` would return the partially decoded string and the position of the decoding error* — [It does](https://doc.rust-lang.org/std/str/struct.Utf8Error.html#method.valid_up_to). – Shepmaster Oct 04 '18 at 15:16
  • Yes it does return the position in `Utf8Error`, but not the partially decoded (valid) string. – Luke McCarthy Oct 04 '18 at 15:20
  • Then just slice the input and convert it with `from_utf8_unchecked`. – Shepmaster Oct 04 '18 at 15:21
  • Ah yes I could do that. But unfortunately that requires an unsafe block, when it would be possible without an unsafe block if it was returned by the error. – Luke McCarthy Oct 04 '18 at 15:25
  • 1
    There's an interesting blog post related to this topic: https://www.fpcomplete.com/blog/2018/07/streaming-utf8-haskell-rust – SirDarius Oct 04 '18 at 15:37

1 Answers1

1

I encourage you to learn how to produce a Minimal, Complete, and Verifiable example. This is a valuable skill that professional programmers use to better understand problems and focus attention on the important aspects of a problem. For example, you didn't provide the actual input file, so it's very difficult for anyone to reproduce your behavior using the code you provided.

After trial-and-error, I was able to reduce your problem down to this code:

use std::io::{self, BufRead};

fn main() -> io::Result<()> {
    let mut reader = io::BufReader::with_capacity(100, io::repeat(b'a'));

    let a = reader.fill_buf()?.len();
    reader.consume(98);
    let b = reader.fill_buf()?.len();

    println!("{}, {}", a, b); // 100, 2

    Ok(())
}

Unfortunately for your case, this behavior is allowed by the contract of BufRead and is in fact almost required. The point of a buffered reader is to avoid making calls to the underlying reader as much as possible. The trait does not know how many bytes you need to read, and it doesn't know that 2 bytes isn't enough and it should perform another call. Flipping it the other way, pretend you had only consumed 1 byte out of 100 — would you want all 99 of those remaining bytes to be copied in memory and then perform another underlying read? That would be slower than not using a BufRead in the first place!

The trait also doesn't have any provisions for moving the remaining bytes in the buffer to the beginning and then filling the buffer again. This is something that seems like it could be added to the concrete BufReader, so you may wish to provide a pull request to add it.

For now, I'd recommend using Read::read_exact at the end of the buffer:

use std::io::{self, BufRead, Read};

fn main() -> io::Result<()> {
    let mut reader = io::BufReader::with_capacity(100, io::repeat(b'a'));

    let a = reader.fill_buf()?.len();
    reader.consume(98);

    let mut leftover = [0u8; 4]; // a single UTF-8 character is at most 4 bytes
    // Assume we know we need 3 bytes based on domain knowledge
    reader.read_exact(&mut leftover[..3])?;

    let b = reader.fill_buf()?.len();

    println!("{}, {}", a, b); // 100, 99

    Ok(())
}

See also:

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366