How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string?

Question

I am new to Rust and I was trying to split Devanagari (vowels and) bi-tri and tetra conjuncts consonants as whole while keeping the vowel sign and virama. and later map them with other Indic script. I first tried using Rust's chars() which didn't work. Then I came across grapheme clusters. I have been googling and searching SO about Unicode and UTF-8, grapheme clusters, and complex scripts.

I have used grapheme clusters in my current code, but it does not give me the desired output. I understand that this method may not work for complex scripts like Devanagari or other Indic scripts.

How can I achieve the desired output? I have another code where I attempted to build a simple cluster using an answer from Stack Overflow, converting it from Python to Rust, but I have not had any luck yet. It's been 2 weeks and I have been stuck on this problem.

Here's the Devanagari Script and Conjucts wiki:

Devanagari Script: https://en.wikipedia.org/wiki/Devanagari
Devanagari Conjucts: https://en.wikipedia.org/wiki/Devanagari_conjuncts

Here's what I wrote to split:

extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;


fn main() {
    
    let hs = "हिन्दी मुख्यमंत्री हिमंत";
    let hsi = hs.graphemes(true).collect::<Vec<&str>>();
    for i in hsi { 
        print!("{}  ", i); // double space eye comfort
    }
}

Current output:
हि न् दी मु ख् य मं त् री हि मं त

Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त

My another try:

I also tried to create a simple grapheme cluster following this SO answer https://stackoverflow.com/a/6806203/2724286

fn split_conjuncts(text: &str) -> Vec<String> {
    let mut result = vec![];
    let mut temp = String::new();

    for c in text.chars() {
        if (c as u32) >= 0x0300 && (c as u32) <= 0x036F {
            temp.push(c);
        } else {
            temp.push(c);
            if !temp.is_empty() {
                result.push(temp.clone());
                temp.clear();
            }
        }
    }
    if !temp.is_empty() {
        result.push(temp);
    }
    result
}

fn main() {
    let text = "संस्कृतम्";
    let split_tokens = split_conjuncts(text);
    println!("{:?}", split_tokens);

}

Output:
["स", "\u{902}", "स", "\u{94d}", "क", "\u{943}", "त", "म", "\u{94d}"]

So, how can I get the desired output?

Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त

I also checked other SO answers (links below) dealing issues with Unicode, grpahemes, UTF-8, but no luck yet.

Combined diacritics do not normalize with unicodedata.normalize (PYTHON)

what-is-the-difference-between-combining-characters-and-grapheme-extenders

extended-grapheme-clusters-stop-combining

I know nothing about that script, but maybe ICU boundary analysis can help: https://unicode-org.github.io/icu/userguide/boundaryanalysis/ If that doesn't help, then I'm afraid you've got to rely on mechanisms outside of the Unicode-specified dataset (i.e. something specific to Devanagari). — Joachim Sauer, Jan 23 '23 at 14:01
@JoachimSauer Thanks. So far I understood, default grapheme cluster boundary might not work in this specific scenario. But not sure, I might be wrong. — InsParbo, Jan 23 '23 at 14:11
@InsParbo you may want to try pinging Manish Goregaokar (Manishearth) directly, he is very involved in rust i18n, and has specific extensive knowledge of indic scripts [1](https://gist.github.com/Manishearth/97900bf1de47f1389e409cc030d84f2c) [2](https://twitter.com/manishearth/status/963473961810931712) [3](https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/). — Masklinn, Jan 23 '23 at 15:40
Incidentally the javascript [`Intl.Segmenter`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) API yields what you expected, at least on Safari on macOS. But could be a difference of interpretation / exact spec, or could be a bug i uniseg, I don't have the knowledge to say. — Masklinn, Jan 23 '23 at 15:48
@Masklinn, Thanks. The javascript Intl.Segementer API you provided works. Is there any equivalent in Rust? I am checking also. If failed, I will try to pin Manish Goregaokar. — InsParbo, Jan 23 '23 at 18:06
@InsParbo I would expect the `unicode_segmentation` crate to be the equivalent, although there's also an icu4x project under the unicode consortium itself, to which manishearth is one of the main contributors, which [provides support for segmentation](https://icu4x.unicode.org/doc/icu/segmenter/struct.GraphemeClusterSegmenter.html#method.segment_str). It is flagged experimental though, so YMMV. — Masklinn, Jan 24 '23 at 09:42
The unicode_semgmentation project might also appreciate getting an error report if it's not already a known issue (to see on their bug tracker). — Masklinn, Jan 24 '23 at 09:43
@Masklinn Good points. I have tested icu4x. It shows error: "cannot find function `get_provider` in crate `icu_testdata`" following this link: https://lib.rs/crates/icu_segmenter. Also, Thanks for the suggestions. I will be submitting an issue on the Unicode Segmentation project's GitHub repository shortly. — InsParbo, Jan 24 '23 at 21:16
@InsParbo you need to enable the appropriate features to get those APIs. (ICU4X has a discussions tab on GitHub, you can ask questions there, though I recommend going through the tutorials first) — Manishearth, Jan 24 '23 at 22:31

How to split Devanagari bi-tri and tetra conjunct consonants as a whole from a string?

0 Answers0