6

I'm trying to read a file one line at a time in Rust, and started by following the advice in this question:

use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let reader = BufReader::new(file);
    for line in reader.lines() {
        match line {
            Ok(line) => println!("Ok: {}", line),
            Err(error) => println!("Err: {}", error),
        }
    }
    return Ok(());
}

However, I have non-UTF8 files. The Python chardet.universaldetector library tells me this is ISO-8859-1:

Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire

Out of the box, Rust is unable to interpret the lines with non-UTF8 characters:

$ ./target/release/main1 
Ok: Cuba
Err: stream did not contain valid UTF-8
Ok: Cyprus
Ok: Czech Republic
Err: stream did not contain valid UTF-8

So I tried the encoding_rs_io library. I'm using Windows 1252 here instead of ISO-8859-1, but it seems to work with this data:

use std::error::Error;
use std::fs::File;
use std::io::Read;

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let mut reader = DecodeReaderBytesBuilder::new().encoding(Some(WINDOWS_1252)).build(file);
    let mut buffer = vec![];
    reader.read_to_end(&mut buffer)?;
    println!("{}", String::from_utf8(buffer).unwrap());
    return Ok(());
}

This successfully reads the UTF8 characters:

$ ./target/release/main2 
Cuba
Curaçao
Cyprus
Czech Republic
Côte d'Ivoire

However, it does not have a lines() method, so cannot read one line at a time. I note that the ripgrep project uses this library to decode non-UTF8 files, and I've stepped into its source code in a debugger. As far as I can tell, it's doing its own hand-rolled CR/LF detection.

So, surely the task of reading non-UTF8 files one line at a time in Rust must have been solved already. Do I really need to reinvent the wheel on this? Help gratefully appreciated!

Huw Walters
  • 1,888
  • 20
  • 20
  • 1
    `DecodeReaderBytes` [implements `io::Read`](https://docs.rs/encoding_rs_io/0.1.7/encoding_rs_io/struct.DecodeReaderBytes.html#implementations), so you should be able to wrap it in a [`std::io::BufReader`](https://doc.rust-lang.org/std/io/struct.BufReader.html) and use [its `lines` method](https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines). – Jmb Sep 24 '20 at 06:51
  • Awesome, that works, thanks. If you would like to put that in an answer, I'll upvote it. – Huw Walters Sep 24 '20 at 07:24

1 Answers1

13

DecodeReaderBytes implements io::Read, so you should be able to wrap it in a std::io::BufReader and use its lines method:

use std::error::Error;
use std::fs::File;
use std::io::{BufReader, BufRead, Read};

use encoding_rs::WINDOWS_1252;
use encoding_rs_io::DecodeReaderBytesBuilder;

fn main() -> Result<(), Box<dyn Error>> {
    let file = File::open("countries.txt")?;
    let mut reader = BufReader::new(
        DecodeReaderBytesBuilder::new()
            .encoding(Some(WINDOWS_1252))
            .build(file));
    for line in reader.lines() {
        println!("{}", line);
    }
    return Ok(());
}
vallentin
  • 23,478
  • 6
  • 59
  • 81
Jmb
  • 18,893
  • 2
  • 28
  • 55