1

I'm wondering if there is a canonical way to read Unicode files in Rust. ​Most examples read a line, because if the file is well formed utf8 a line should consist of whole/complete 'characters' (Unicode Scalar Values).

Here's a simple example of reading a file as a utf8 file, but only works if 'one byte' == 'one character', which isn't guaranteed.

​let mut chr: char;
​let f = File::open(filename).expect("File not found");
​let mut rdr = BufReader::new(f);
​while (true) {
    ​let mut x: [u8; 1] = [0];
    ​let n = rdr.read(&mut x);
    ​let bytes = utf8_char_width(x[0]); // unstable feature
    ​chr = x[0] as char;
   ​ ...

I'm new to Rust, but the only thing I could find that would help me read a full character was the utf8_char_width, which is marked unstable.

Does Rust have a facility such that I can open a file as (Unicode) 'text' and it will read/respect the BOM (if available) and allow me to iterate over the contents of that file returning a Rust char type for each 'character' (Unicode Scalar Value) found?

Am I making something easy hard? Again, I'm new to Rust so everything is hard to me currently :-D

Update (in response to comments)

  • A "Unicode file" is a file containing only Unicode encoded data. I'd like to be able to read Unicode encoded files, without worrying about the various details of character size or endianness. Since Rust uses a four byte (u32) 'char' I'd like to be able to read the file one character at a time, not worrying about line length (or it's allocation).

  • While UTF8 is byte oriented, the Unicode standard does define a BOM for it as well as saying that the default (no BOM) is UTF8.

It is somewhat counter-intuitive (I'm new to Rust) that the char type is UTF32 while a string is (effectively) a vector of u8. However, I can see the reasoning behind forcing the developer to be explicit regarding 'byte' or 'char' as I've seen a lot of bugs caused by people assuming that those are the same size. Clearly, there is an iterator to return char's from a string so the code to handle the UTF8 -> UTF32 is in place, it just needs to take it's input from a file stream rather than a memory vector. Perhaps as I learn more a solution will present itself.

Dweeberly
  • 4,668
  • 2
  • 22
  • 41
  • Also see https://stackoverflow.com/questions/35385703/read-file-character-by-character-in-rust, though it doesn't mention BOM handling. – SirDarius Jan 12 '22 at 23:28
  • What is a "Unicode file," precisely? Also, what problem are you trying to solve at a high level? – BurntSushi5 Jan 12 '22 at 23:34
  • 3
    The only reasonable thing to do with a BOM in UTF-8 is skip it - endian doesn't matter for UTF-8 (It does for UTF-16 and 32, but its unclear if you are using those). If you want to iterate on each code point (which is _not the same as a grapheme_) use `str.chars()`. – Colonel Thirty Two Jan 13 '22 at 00:11
  • The Rust Book contains some guides on String processing [Here](https://doc.rust-lang.org/book/ch08-02-strings.html). In case you are working on Graphemes (see guide), look into using [this crate](https://crates.io/crates/unicode-segmentation) – Achyut-BK Jan 13 '22 at 01:43

0 Answers0