54

I am learning Rust and I just have been surprised by the fact that Rust only is able to distinguish UTF-8 byte sequences, but not actual grapheme clusters (i.e. a diacritic is considered as a distinct "char").

So for example, Rust can turn input text to a vector like this (with the help of "नमस्ते".chars()):

['न', 'म', 'स', '्', 'त', 'े'] // 4 and 6 are diacritics and shouldn't be distinct items

But how do I get a vector like this?

["न", "म", "स्", "ते"]
Nurbol Alpysbayev
  • 19,522
  • 3
  • 54
  • 89
  • Possible duplicate of [How do I count unique grapheme clusters in a string in Rust?](https://stackoverflow.com/questions/51818497/how-do-i-count-unique-grapheme-clusters-in-a-string-in-rust) – Denys Séguret Nov 08 '19 at 17:53

1 Answers1

58

You want to use the unicode-segmentation crate:

use unicode_segmentation::UnicodeSegmentation; // 1.5.0

fn main() {
    for g in "नमस्ते्".graphemes(true) {
        println!("- {}", g);
    }
}

(Playground, note: the playground editor can't properly handle the string, so the cursor position is wrong in this one line)

This prints:

- न
- म
- स्
- ते्

The true as argument means that we want to iterate over the extended grapheme clusters. See graphemes documentation for more information.


Segmentation into Unicode grapheme clusters was supported by the standard library at some point, but unfortunately it was deprecated and then removed due to the size of the required Unicode tables. Instead, the de-facto solution is to use the crate. But yes, I think it's really unfortunate that the "default standard library segmentation" uses codepoints which semantically do not make a lot of sense (i.e. counting them or splitting them up generally doesn't make sense).

Lukas Kalbertodt
  • 79,749
  • 26
  • 255
  • 305
  • @Nurbol Alpysbayev Let me know if I correctly copied the string and if the result is correct. I unfortunately do not understand this script and cannot really compare it to your expected result. – Lukas Kalbertodt Nov 08 '19 at 16:47
  • Ahh, so using the crate is the de-facto standard of doing that? Frankly I've searched some crate without success. Thank you so much! This is exactly what I need! – Nurbol Alpysbayev Nov 08 '19 at 16:48
  • 2
    @NurbolAlpysbayev I added some explanation regarding it being the de-facto solution. – Lukas Kalbertodt Nov 08 '19 at 16:51