Handling Out of Range Hex/Unicode

Question

I'm working with a Rust cdylib crate that I'm referencing and using in C++.

#[no_mangle]
pub extern "C" fn some_function(name: *const c_char, text: *const c_char) {
    unsafe {
        let name = CStr::from_ptr(name).to_str().unwrap();
        let text = CStr::from_ptr(text).to_str().unwrap();

        // the rest
    }
}

When this function receives the character ±, it panics when attempting to get the text from the pointer. I'm passing this character in as a c_str() in C++ from a std::string:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src\lib.rs:102:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is there any way that I can properly handle this character in Rust? I don't need to manipulate it in any way, realistically this library is simple acting as a middle man, and just needs to pass it along.

When I use this to view the bytes I'm receiving:

let raw = CStr::from_ptr(text);

println!("Bytes: {:?}", raw.to_bytes_with_nul());

I get:

Bytes: [177, 0]

Make sure the C++ code passes strings in UTF-8? The ± character in UTF-8 is 0xC2 0xB1. Apparently you have some other bytes that are not 0xC2 0xB1. — user253751, Mar 31 '23 at 13:06
If you don't need to manipulate it in any way why can't it stay a `CStr` or a raw pointer? — cafce25, Mar 31 '23 at 13:10
It depends why you're passing it along, why convert it to a `&str`? — SeedyROM, Mar 31 '23 at 13:10
I'm passing it over gRPC, and the proto that I have compiled wants a String. — Daedalus, Mar 31 '23 at 13:18
@user253751 Converting it to a byte array in Rust shows it as 177, which would equal 0xB1, which is a part of this that's confusing me... — Daedalus, Mar 31 '23 at 13:19
Please include the bytes that your Rust receive in the question. — Chayim Friedman, Mar 31 '23 at 13:21
Looks like ISO-8859-1 to me, see: [What are the options to convert ISO-8859-1 / Latin-1 to a String (UTF-8)?](https://stackoverflow.com/questions/28169745/what-are-the-options-to-convert-iso-8859-1-latin-1-to-a-string-utf-8) — cafce25, Mar 31 '23 at 13:44

Finomnis · Accepted Answer · 2023-03-31T14:52:54.583

Here is how I reproduced your problem:

use std::ffi::CStr;

fn main() {
    let raw_data: &[u8] = &[177, 0];
    let raw = unsafe { CStr::from_ptr(raw_data.as_ptr().cast()) };

    println!("Bytes: {:?}", raw.to_bytes_with_nul());

    let string = raw.to_str().unwrap();
    println!("{}", string);
}

Bytes: [177, 0]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src\main.rs:9:31

The problem here is that to_str() expects a valid UTF-8 string. [177] is not valid UTF-8. The valid UTF-8 version would be:

println!("{:?}", "±".as_bytes());

[194, 177]

Yours seems to be encoded differently, for example Windows-1252. I will simply assume so, because without more knowledge about your code, there is no way of telling for sure. But it is very likely, as this is the default encoding for Windows in the western world.

The easiest way to convert between encodings is via the crate encoding_rs. Rust itself only has UTF-8 support built in, so you need to use external crates for it, and this is the most established one.

use std::ffi::{c_char, CStr};

use encoding_rs::WINDOWS_1252;

fn main() {
    let raw_data: *const c_char = (&[177u8, 0u8]).as_ptr().cast();

    let raw = unsafe { CStr::from_ptr(raw_data) };

    println!("Bytes: {:?}", raw.to_bytes_with_nul());

    let (string, actual_encoding, errors) = WINDOWS_1252.decode(raw.to_bytes());

    println!("String: {:?}", string);
    println!("Actual encoding: {:?}", actual_encoding);
    println!("Errors: {}", errors);
}

Bytes: [177, 0]
String: "±"
Actual encoding: Encoding { windows-1252 }
Errors: false

This is the exact path we just went down, thanks for the detailed answer! — Daedalus, Mar 31 '23 at 14:47

Handling Out of Range Hex/Unicode

1 Answers1