19

In Rust it's possible to get UTF-8 from bytes by doing this:

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

This either works or it doesn't, but Python has the ability to handle errors, e.g.:

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

In this example the argument surrogateescape converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8. see: Python docs for details.

Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?

ideasman42
  • 42,413
  • 44
  • 197
  • 320

2 Answers2

17

Yes, via String::from_utf8_lossy:

fn main() {
    let text = [104, 101, 0xFF, 108, 111];
    let s = String::from_utf8_lossy(&text);
    println!("{}", s); // he�lo
}

If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.

A quickly hacked-up example:

use std::str;

fn example(mut bytes: &[u8]) -> String {
    let mut output = String::new();

    loop {
        match str::from_utf8(bytes) {
            Ok(s) => {
                // The entire rest of the string was valid UTF-8, we are done
                output.push_str(s);
                return output;
            }
            Err(e) => {
                let (good, bad) = bytes.split_at(e.valid_up_to());

                if !good.is_empty() {
                    let s = unsafe {
                        // This is safe because we have already validated this
                        // UTF-8 data via the call to `str::from_utf8`; there's
                        // no need to check it a second time
                        str::from_utf8_unchecked(good)
                    };
                    output.push_str(s);
                }

                if bad.is_empty() {
                    //  No more data left
                    return output;
                }

                // Do whatever type of recovery you need to here
                output.push_str("<badbyte>");

                // Skip the bad byte and try again
                bytes = &bad[1..];
            }
        }
    }
}

fn main() {
    let r = example(&[104, 101, 0xFF, 108, 111]);
    println!("{}", r); // he<badbyte>lo
}

You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
    // ...    
                handler(&mut output, bad);
    // ...
}
let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
    use std::fmt::Write;
    write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo

See also:

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 3
    Note that `from_utf8_lossy` doesn't provide different ways of handling errors as Python does. Instead of escaping, invalid utf-8 sequences are replaced with `U+FFFD` (matching Python's `replace` behavior). So I think the short answer to this question is **no**, though its worth mentioning `from_utf8_lossy` still. – ideasman42 Jan 04 '17 at 04:08
  • The short answer to either of the posed questions ("Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?" or "Does Rust have a way to get a UTF-8 string from bytes which handles errors without failing entirely?") is **no**? I'm pretty sure that this code does exactly that. – Shepmaster Jan 04 '17 at 04:14
  • 2
    The docs for `from_utf8_lossy` state: *"During this conversion, from_utf8_lossy() will replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �"*. So this is a replacement, not an escape sequence. The first part of this answer shows how converting with an escape sequence could be written: http://stackoverflow.com/a/41450295/432509 – ideasman42 Jan 04 '17 at 04:22
  • @ideasman42 what do you mean by escape sequence in this case? What's an example? – Shepmaster Jan 04 '17 at 04:26
  • Instead of replacing the character, the escape sequence shows the character using some identifier `\N{...}` for example, so instead of being *lossy*, it includes the characters in the string (typically as a number). See: https://docs.python.org/3/library/codecs.html#error-handlers for some examples. As noted in the OP, Python can use `surrogateescape` for this. Will clarify the question since anyone not familiar with Python won't find it so helpful. – ideasman42 Jan 04 '17 at 04:36
  • For the example given in the answer, `bytes([104, 101, 0xFF, 108, 111]).decode('utf-8', 'surrogateescape')` would evaluate to `'he\udcfflo'`, with `U+DCFF` being the "escape" character (a code point normally [not valid in Unicode](http://www.fileformat.info/info/unicode/char/dcff/index.htm)) used to represent the invalid 0xff byte. Replacing 0xff with 0xfe produces `\udcfe`, and so on. – user4815162342 Jan 04 '17 at 12:35
  • @user4815162342 But `surrogateescape` would be completely pointless in Rust; it seems like an alternate implementation of `OsStr`. – Shepmaster Jan 04 '17 at 16:24
  • Also, it wouldn't actually work in today's Rust, whose strings and chars reject code points - for example, `"he\u{dcff}lo"` is a compile-time error, and `::std::char::from_u32(0xdcff)` returns `None`. – user4815162342 Jan 04 '17 at 16:36
  • The question only asks about escaping the string, not how to perform `surrogateescape` in Rust, thats just an example of a common escaping method used in Python. – ideasman42 Jan 04 '17 at 16:39
2

You can either:

  1. Construct it yourself by using the strict UTF-8 decoding which returns an error indicating the position where the decoding failed, which you can then escape. But that's inefficient since you will decode each failed attempt twice.

  2. Try 3rd party crates which provide more customizable charset decoders.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
the8472
  • 40,999
  • 5
  • 70
  • 122
  • *decode each failed attempt twice* — could you expand a bit more on that? I'm not seeing the double decode attempt. – Shepmaster Jan 04 '17 at 16:39
  • Re: *"But that's inefficient since you will decode each failed attempt twice."* seems like there should be a better way that can be done in a small function, similar to this answer, but supporting valid utf8: http://stackoverflow.com/a/41450295/432509 – ideasman42 Jan 04 '17 at 16:46
  • @Shepmaster, where do you see it being possible with a single pass in the presence of errors? – the8472 Jan 04 '17 at 16:49
  • @ideasman42 that better way is the 2nd option I have suggested. – the8472 Jan 04 '17 at 16:50
  • Starting at the beginning, you parse until you hit an error, skip the error / add whatever marker you need, then continue parsing after the error. You only read each byte once, making a single pass over all the data. Thus why I'm asking what I'm missing. – Shepmaster Jan 04 '17 at 18:09
  • @Shepmaster to find the error position you need to call `from_utf8`, then you need to call it again on the prefix to get a valid partial result. so that's 2 passes over the input. in the stdlib there's nothing that takes `u8`s incrementally (a charset decoder basically), which seems to be what OP is looking for. – the8472 Jan 04 '17 at 18:33