4

I want to split a UTF-8 string into chunks of equal size. I came up with a solution that does exactly that. Now I want to simplify it removing the first collect call if possible. Is there a way to do it?

fn main() {
    let strings = "ĄĆĘŁŃÓŚĆŹŻ"
        .chars()
        .collect::<Vec<char>>()
        .chunks(3)
        .map(|chunk| chunk.iter().collect::<String>())
        .collect::<Vec<String>>();
    println!("{:?}", strings);
}

Playground link

Grzegorz Żur
  • 47,257
  • 14
  • 109
  • 105
  • 1
    Seems like, in order to get chunks, you need to collect into vectors. See here: https://stackoverflow.com/questions/42134874/are-there-equivalents-to-slicechunks-windows-for-iterators-to-loop-over-pairs – cadolphs Jun 24 '21 at 20:49
  • 3
    As always with unicode strings you need to be careful about exactly what you mean by "equal sized chunks". You may want to be considering graphemes rather than characters - since this will split up combining characters and combining emoji. – Michael Anderson Jun 25 '21 at 00:18
  • 1
    Here's an example of the issues raised by @MichaelAnderson: [playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=0d361c546c6b69fe68bf644e76114b13) – Jmb Jun 25 '21 at 07:07

3 Answers3

3

You can use chunks() from Itertools.

use itertools::Itertools; // 0.10.1

fn main() {
    let strings = "ĄĆĘŁŃÓŚĆŹŻ"
        .chars()
        .chunks(3)
        .into_iter()
        .map(|chunk| chunk.collect::<String>())
        .collect::<Vec<String>>();
    println!("{:?}", strings);
}
Francis Gagné
  • 60,274
  • 7
  • 180
  • 155
3

This doesn't require Itertools as a dependency and also does not allocate, as it iterates over slices of the original string:

fn chunks(s: &str, length: usize) -> impl Iterator<Item=&str> {
    assert!(length > 0);
    let mut indices = s.char_indices().map(|(idx, _)| idx).peekable();
    
    std::iter::from_fn(move || {
        let start_idx = match indices.next() {
            Some(idx) => idx,
            None => return None,
        };
        for _ in 0..length - 1 {
            indices.next();
        }
        let end_idx = match indices.peek() {
            Some(idx) => *idx,
            None => s.bytes().len(),
        };
        Some(&s[start_idx..end_idx])
    })
}


fn main() {
    let strings = chunks("ĄĆĘŁŃÓŚĆŹŻ", 3).collect::<Vec<&str>>();
    println!("{:?}", strings);
}
user2722968
  • 13,636
  • 2
  • 46
  • 67
0

Having considered the problem with graphemes I ended up with the following solution.

I used the unicode-segmentation crate.

use unicode_segmentation::UnicodeSegmentation;                                                                                                                            

fn main() {
    let strings = "ĄĆĘŁŃÓŚĆŹŻèèèèè"
        .graphemes(true)                                                                                                                                          
        .collect::<Vec<&str>>()                                                                                                                                   
        .chunks(length)                                                                                                                                           
        .map(|chunk| chunk.concat())                                                                                                                              
        .collect::<Vec<String>>();
    println!("{:?}", strings);
}

I hope some simplifications can still be made.

Grzegorz Żur
  • 47,257
  • 14
  • 109
  • 105