I'm wondering if there is a canonical way to read Unicode files in Rust. Most examples read a line, because if the file is well formed utf8 a line should consist of whole/complete 'characters' (Unicode Scalar Values).
Here's a simple example of reading a file as a utf8 file, but only works if 'one byte' == 'one character', which isn't guaranteed.
let mut chr: char;
let f = File::open(filename).expect("File not found");
let mut rdr = BufReader::new(f);
while (true) {
let mut x: [u8; 1] = [0];
let n = rdr.read(&mut x);
let bytes = utf8_char_width(x[0]); // unstable feature
chr = x[0] as char;
...
I'm new to Rust, but the only thing I could find that would help me read a full character was the utf8_char_width, which is marked unstable.
Does Rust have a facility such that I can open a file as (Unicode) 'text' and it will read/respect the BOM (if available) and allow me to iterate over the contents of that file returning a Rust char
type for each 'character' (Unicode Scalar Value) found?
Am I making something easy hard? Again, I'm new to Rust so everything is hard to me currently :-D
Update (in response to comments)
A "Unicode file" is a file containing only Unicode encoded data. I'd like to be able to read Unicode encoded files, without worrying about the various details of character size or endianness. Since Rust uses a four byte (u32) 'char' I'd like to be able to read the file one character at a time, not worrying about line length (or it's allocation).
While UTF8 is byte oriented, the Unicode standard does define a BOM for it as well as saying that the default (no BOM) is UTF8.
It is somewhat counter-intuitive (I'm new to Rust) that the char
type is UTF32 while a string is (effectively) a vector of u8
. However, I can see the reasoning behind forcing the developer to be explicit regarding 'byte' or 'char' as I've seen a lot of bugs caused by people assuming that those are the same size. Clearly, there is an iterator to return char's from a string so the code to handle the UTF8 -> UTF32 is in place, it just needs to take it's input from a file stream rather than a memory vector. Perhaps as I learn more a solution will present itself.