Is there a canonical way to read a unicode file one 'char' at a time, respecting BOM

Question

I'm wondering if there is a canonical way to read Unicode files in Rust. Most examples read a line, because if the file is well formed utf8 a line should consist of whole/complete 'characters' (Unicode Scalar Values).

Here's a simple example of reading a file as a utf8 file, but only works if 'one byte' == 'one character', which isn't guaranteed.

let mut chr: char;
let f = File::open(filename).expect("File not found");
let mut rdr = BufReader::new(f);
while (true) {
    let mut x: [u8; 1] = [0];
    let n = rdr.read(&mut x);
    let bytes = utf8_char_width(x[0]); // unstable feature
    chr = x[0] as char;
    ...

I'm new to Rust, but the only thing I could find that would help me read a full character was the utf8_char_width, which is marked unstable.

Does Rust have a facility such that I can open a file as (Unicode) 'text' and it will read/respect the BOM (if available) and allow me to iterate over the contents of that file returning a Rust char type for each 'character' (Unicode Scalar Value) found?

Am I making something easy hard? Again, I'm new to Rust so everything is hard to me currently :-D

Update (in response to comments)

A "Unicode file" is a file containing only Unicode encoded data. I'd like to be able to read Unicode encoded files, without worrying about the various details of character size or endianness. Since Rust uses a four byte (u32) 'char' I'd like to be able to read the file one character at a time, not worrying about line length (or it's allocation).
While UTF8 is byte oriented, the Unicode standard does define a BOM for it as well as saying that the default (no BOM) is UTF8.

It is somewhat counter-intuitive (I'm new to Rust) that the char type is UTF32 while a string is (effectively) a vector of u8. However, I can see the reasoning behind forcing the developer to be explicit regarding 'byte' or 'char' as I've seen a lot of bugs caused by people assuming that those are the same size. Clearly, there is an iterator to return char's from a string so the code to handle the UTF8 -> UTF32 is in place, it just needs to take it's input from a file stream rather than a memory vector. Perhaps as I learn more a solution will present itself.

Also see https://stackoverflow.com/questions/35385703/read-file-character-by-character-in-rust, though it doesn't mention BOM handling. — SirDarius, Jan 12 '22 at 23:28
What is a "Unicode file," precisely? Also, what problem are you trying to solve at a high level? — BurntSushi5, Jan 12 '22 at 23:34
The only reasonable thing to do with a BOM in UTF-8 is skip it - endian doesn't matter for UTF-8 (It does for UTF-16 and 32, but its unclear if you are using those). If you want to iterate on each code point (which is _not the same as a grapheme_) use `str.chars()`. — Colonel Thirty Two, Jan 13 '22 at 00:11
The Rust Book contains some guides on String processing [Here](https://doc.rust-lang.org/book/ch08-02-strings.html). In case you are working on Graphemes (see guide), look into using [this crate](https://crates.io/crates/unicode-segmentation) — Achyut-BK, Jan 13 '22 at 01:43

Is there a canonical way to read a unicode file one 'char' at a time, respecting BOM

0 Answers0