What are the options to convert ISO-8859-1 / Latin-1 to a String (UTF-8)?

Question

I scanned the Rust documentation for some way to convert between character encodings but did not find anything. Did I miss something?

Is it supported (directly or indirectly) by the Rust language and its standard libraries or even planned to be in the near future?

As one of the answers suggested that there is an easy solution because u8 can be cast to (Unicode) chars. With Unicode being a superset of the codepoints in ISO-8859-1, thats a 1:1 mapping which encodes to multiple bytes in UTF-8 which is the internal encoding of Strings in Rust.

fn main() {
    println!("{}", 196u8 as char);
    println!("{}", (196u8 as char) as u8);
    println!("{}", 'Ä' as u8);
    println!("{:?}", 'Ä'.to_string().as_bytes());
    println!("{:?}", "Ä".as_bytes());
    println!("{}",'Ä' == 196u8 as char);
}

gives:

Ä
196
196
[195, 132]
[195, 132]
true

Which I had not even considered to work!

Well with Rust it is a bit hard to tell what is a "standard library" and what is not as this may change on a daily basis :) — OderWat, Jan 27 '15 at 12:14
True enough, in this case however I could see the people concerned by binary size cringing at the idea of embedding a conversion algorithm to and fro every single known character encoding. — Matthieu M., Jan 27 '15 at 12:25

score 17 · Accepted Answer · edited Nov 27 '16 at 22:02

17

Strings in Rust are unicode (UTF-8), and unicode codepoints are a superset of iso-8859-1 characters. This specific conversion is actually trivial.

fn latin1_to_string(s: &[u8]) -> String {
    s.iter().map(|&c| c as char).collect()
}

We interpret each byte as a unicode codepoint and then build a String from these codepoints.

edited Nov 27 '16 at 22:02

Timmmm

88,195
71
364
509

answered Jan 27 '15 at 16:44

barjak

10,842
3
33
47

9

Since I got tripped up by this, be aware that "only codepoints 0 - 127 are encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1" ([source](http://stackoverflow.com/a/7048780/155423)). This means you can't simply reinterpret a slice of `u8` in ISO-8859-1 as UTF-8. – Shepmaster Jan 27 '15 at 18:55
2

Yes, "encoded into UTF-8" but the codepoints themselves are identical. This is what makes his answer the perfect solution for encoding ISO-8859-1 to UTF-8. It is just as simple as converting every ISO-8859-1 byte to char using "as char". My special case deals with ISO-8859-15 which just means that we have to convert some few chars differently. – OderWat Jan 31 '15 at 15:12
minor nitpick: if `s` may end prematurely with a zero byte, you need to throw a `.take_while(|c| **c != 0)` into the mix – Benni Jul 03 '23 at 21:09

score 9 · Answer 2 · answered Jan 27 '15 at 12:39

9

Standard library does not have any API to deal with encodings. Encodings, like date and time, are difficult to do right and need a lot of work, so they are not present in the std.

The crate to deal with encodings as of now is rust-encoding. You will almost certainly find everything you need there.

answered Jan 27 '15 at 12:39

Vladimir Matveev

120,085
34
287
296

Yeah... This is what we use already. I just wanted to double-check if we made any oversight in the current standard library. I also know that there is something going on with the IO overhaul. But as far as I read this discussion does not involve other Encodings beside UNICODE representations. – OderWat Jan 27 '15 at 13:31
No, I don't think that encodings are a part of I/O reimplementation. Moreover, AFAIK it is kinda hard to obtain streaming decoders/encoders like Java's `InputStreamReader`/`OutputStreamWriter` with rust-encoding, so there is definitely a room for improvement. – Vladimir Matveev Jan 27 '15 at 13:34
"Standard library does not have any API to deal with encodings." What? It has a ton of APIs to deal with UTF-8, and some for UTF-16. – Timmmm Jan 22 '22 at 16:01
@Timmmm Under "deal with encodings" I meant that there is no comprehensive API for working with various encodings, like e.g. in Java with its `java.nio.charset.Charset` machinery. Sure, there are many methods for working with UTF-8 strings, and some methods to work with UTF-16, but this hardly can be called "API to deal with encodings", on the level of rust-encoding or encoding_rs. The question author's was lucky in that latin-1 is indeed a subset of unicode, and simple code point conversion worked, but for any other encoding the standard library wouldn't have helped. – Vladimir Matveev Feb 02 '22 at 23:50

What are the options to convert ISO-8859-1 / Latin-1 to a String (UTF-8)?

2 Answers2

Linked