3

Does Rust provide a way to decode a single character (unicode-scalar-value to be exact) from a &[u8], which may be multiple bytes, returning a single USV?

Something like GLib's g_utf8_get_char & g_utf8_next_char:

// Example of what glib's functions might look like once ported to Rust.
let i = 0;
while i < slice.len() {
    let unicode_char = g_utf8_get_char(&slice[i..]);

    // do something with the unicode character
    funcion(unicode_char);

    // move onto the next.
    i += g_utf8_next_char(&slice[i..]);
}

Short of porting parts of the GLib API to Rust, does Rust provide a way to do this, besides some trial & error calls to from_utf8 which stop once the second character is reached?

See GLib's code.

ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • as I have already said in a [different question](http://stackoverflow.com/a/41468380/1362755), 3rd party crates provide such low-level functionality. – the8472 Jan 04 '17 at 17:30
  • 2
    Stating *"there is a crate for that"* with a link to some search results - isn't answering the question. In this case I'd rather use stdlib if thats possible. if not - then the answer to this question is simply "no". – ideasman42 Jan 07 '17 at 15:28

2 Answers2

1

No, there is no such functionality publicly exposed in the Rust standard library as of Rust 1.14.


And neither should there be. Rust doesn't believe in a gigantic standard library. Crates are trivial to use and prevent people from rewriting code. Many people have an incorrect opinion (yeah, that's right: an opinion is incorrect) that using dependencies makes their program weaker.

Anything put in the standard library has to be maintained forever. There are zero plans for a Rust 2.0 that would break backwards compatibility. Python is the normal example here, with a multitude of "get data from a URL" parts of the standard library that are all redundant and deprecated now. The Python maintainers have to waste time keeping those working, instead of advancing the language.

Third-party crates allow things to be created, evolve, and die without burdening the entire language.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 4
    counterexample: node's leftpad. i know cargo strives to avoid that particular case. but it's not like there are *no* downsides to relying on additional dependencies. At least you have to vet the reliability (version management, API design, use of unsafe features) of possibly many options when choosing dependencies. – the8472 Jan 04 '17 at 18:35
  • 1
    @the8472 leftpad is, unfortunately, pure FUD in the context of Rust / Cargo. Yanked versions are still downloadable from the repo, they just won't be automatically selected during upgrade time. For the other points: *version management* — only if you care; you can choose to never upgrade it and just leave the version locked forever. *API design* — true, it may not be the API you want, but **in the context of this question**, the stdlib can *also* provide an API you dislike, and then you are stuck with it. *use of unsafe features* — true, but the stdlib *also* uses unsafe code under the hood. – Shepmaster Jan 04 '17 at 19:03
  • 3
    No, leftpad is not FUD, it merely served as a general example that dependencies can introduce additional headaches. I did not say that I expect cargo to have that *particular* problem, in fact I already said the opposite: *"i know cargo strives to avoid that particular case."* (note my use of the word "particular" here, implying that it was more about a more general issue). What i'm saying is that as you heap on dependencies you add sources of problems but you also add utility. So there are tradeoffs to be made. Ideally you want to extract the maximum utility with the minimum of dependencies. – the8472 Jan 04 '17 at 19:28
  • @the8472 but I'm trying to point out that the standard library *is a dependency*, written by the same humans that make crates. It's not like one is magically going to be better than the other. The standard library is a facade of many small crates, in fact. If I took 10 crates from crates.io and made my own facade, would that be any more palatable? Adding more stuff to the stdlib would have the same effect. – Shepmaster Jan 04 '17 at 19:32
  • 5
    Anyway, I'm not saying that dependencies should be avoided altogether. But they are a tradeoff, so it makes sense to investigate whether the use of any particular one can be avoided. E.g. several times I already had to use multiple crates doing pretty much the same thing at the same time because each only offered a subset of what I actually needed. It certainly does not feel optimal. – the8472 Jan 04 '17 at 19:32
  • *multiple crates doing pretty much the same thing* — that's unfortunate :-( Any idea if those crate authors would be amenable to teaming up? You could always produce your own combination of the three and publish that ;-). – Shepmaster Jan 04 '17 at 19:34
  • 1
    The standard library is written by the language maintainers, no? So there would be some expectation that there are synergies from intimate knowledge of what the compiler compiler does and being developed in tandem with it. Optimal implementations and all that. That provides some confidence in its quality. Confidence which does not come naturally for 3rd party crates. – the8472 Jan 04 '17 at 19:35
  • @the8472 *Confidence which does not come naturally for 3rd party crates* — I meant "written by the same humans" quite literally. Futures is a prime example, as are libraries like Itertools or Regex. All written by core team members (IIRC), but they might not have been core team members when they wrote the libraries. Anywho, I've got the "extended discussions" warning, so feel free to [join our chat](http://chat.stackoverflow.com/rooms/62927/rust) if you'd like to continue disabusing me of my notions. – Shepmaster Jan 04 '17 at 19:38
  • Re: *"Third-party crates allow things to be created, evolve, and die without burdening the entire language."* - This doesn't solves a problem its just moving it around - since either way, if I want my software to be built on a stable base - it must rely on a large body of code (made up of crates), which may evolve and die. Its unavoidable of course. AFAICS it's more a solution from Rust maintainers perspective, and a reason to continue to be careful taking on new dependencies. – ideasman42 Jan 07 '17 at 15:21
  • Re: *"And neither **should** there be."*, while I agree that *including-the-kitchen-sink* is to be avoided. I don't see any compelling reason that decoding a single character is a step-too-far down this path (something to check with the Rust devs I suppose). – ideasman42 Jan 07 '17 at 15:23
  • Rust already has utf8 decoding in the standard library. Exposing that existing functionality some more would be a good idea and prevent needless duplication of effort. – Eloff Oct 18 '21 at 11:36
0

You can convert a byte slice (&[u8]) into a string slice (&str) by using str::from_utf8 (note that this validates that the whole byte slice is valid UTF-8). You can then use the chars() iterator on the string slice to iterate on each character (char) in the string.

Francis Gagné
  • 60,274
  • 7
  • 180
  • 155
  • The limitation with this is you can't control behavior when invalid utf8 byte sequence is encountered (if you wanted to implement something like https://docs.python.org/3.6/library/codecs.html#surrogateescape). – ideasman42 Mar 01 '17 at 04:43