24

Is there an idiomatic way to process a file one character at a time in Rust?

This seems to be roughly what I'm after:

let mut f = io::BufReader::new(try!(fs::File::open("input.txt")));

for c in f.chars() {
    println!("Character: {}", c.unwrap());
}

But Read::chars is still unstable as of Rust v1.6.0.

I considered using Read::read_to_string, but the file may be large and I don't want to read it all into memory.

Lukas Kalbertodt
  • 79,749
  • 26
  • 255
  • 305
Tim McLean
  • 1,576
  • 3
  • 14
  • 20
  • 1
    For some types of text files: `f.lines().flat_map(|l| l.chars())` ... but this is not really a good solution. – Lukas Kalbertodt Feb 13 '16 at 22:07
  • 1
    Have you considered just copying the implementation in the meantime? It's only ~100 lines and means your code will be trivial to upgrade if `chars` stabilizes as-is. – Veedrac Feb 14 '16 at 04:38

3 Answers3

11

Let's compare 4 approaches.

1. Read::chars

You could copy Read::chars implementation, but it is marked unstable with

the semantics of a partial read/write of where errors happen is currently unclear and may change

so some care must be taken. Anyway, this seems to be the best approach.

2. flat_map

The flat_map alternative does not compile:

use std::io::{BufRead, BufReader};
use std::fs::File;

pub fn main() {
    let mut f = BufReader::new(File::open("input.txt").expect("open failed"));

    for c in f.lines().flat_map(|l| l.expect("lines failed").chars()) {
        println!("Character: {}", c);
    }
}

The problems is that chars borrows from the string, but l.expect("lines failed") lives only inside the closure, so compiler gives the error borrowed value does not live long enough.

3. Nested for

This code

use std::io::{BufRead, BufReader};
use std::fs::File;

pub fn main() {
    let mut f = BufReader::new(File::open("input.txt").expect("open failed"));

    for line in f.lines() {
        for c in line.expect("lines failed").chars() {
            println!("Character: {}", c);
        }
    }
}

works, but it keeps allocation a string for each line. Besides, if there is no line break on the input file, the whole file would be load to the memory.

4. BufRead::read_until

A memory efficient alternative to approach 3 is to use Read::read_until, and use a single string to read each line:

use std::io::{BufRead, BufReader};
use std::fs::File;

pub fn main() {
    let mut f = BufReader::new(File::open("input.txt").expect("open failed"));

    let mut buf = Vec::<u8>::new();
    while f.read_until(b'\n', &mut buf).expect("read_until failed") != 0 {
        // this moves the ownership of the read data to s
        // there is no allocation
        let s = String::from_utf8(buf).expect("from_utf8 failed");
        for c in s.chars() {
            println!("Character: {}", c);
        }
        // this returns the ownership of the read data to buf
        // there is no allocation
        buf = s.into_bytes();
        buf.clear();
    }
}
malbarbo
  • 10,717
  • 1
  • 42
  • 57
3

I cannot use lines() because my file could be a single line that is gigabytes in size. This an improvement on @malbarbo's recommendation of copying Read::chars from the an old version of Rust. The utf8-chars crate already adds .chars() to BufRead for you.

Inspecting their repository, it doesn't look like they load more than 4 bytes at a time.

Your code will look the same as it did before Rust removed Read::chars:

use std::io::stdin;
use utf8_chars::BufReadCharsExt;

fn main() {
    for c in stdin().lock().chars().map(|x| x.unwrap()) {
        println!("{}", c);
    }
}

Add the following to your Cargo.toml:

[dependencies]
utf8-chars = "1.0.0"
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
joseph
  • 2,429
  • 1
  • 22
  • 43
  • If there is a common letter, you could read until that character. For example, if you have a line which is 2 GB of A, B, C and D. You could read until an "A" – Mark S. Aug 05 '22 at 15:13
0

There are two solutions that make sense here.

First, you could copy the implementation of Read::chars() and use it; that would make it completely trivial to move your code over to the standard library implementation if/when it stabilizes.

On the other hand, you could simply iterate line by line (using f.lines()) and then use line.chars() on each line to get the chars. This is a little more hacky, but it will definitely work.

If you only wanted one loop, you could use flat_map() with a lambda like |line| line.chars().

Leonora Tindall
  • 1,391
  • 2
  • 12
  • 30
  • As a heads-up, it's generally OK to move useful comments to answers, but if you aren't really adding to the content, it's probably [more acceptable to mark the answer as community wiki](http://meta.stackoverflow.com/q/269913/155423). – Shepmaster May 12 '16 at 13:02
  • Sorry! I'll make sure to do that next time. – Leonora Tindall May 13 '16 at 20:44