107

Based on the Rust book, the String::len method returns the number of bytes composing the string, which may not correspond to the length in characters.

For example if we consider the following string in Japanese, len() would return 30, which is the number of bytes and not the number of characters, which would be 10:

let s = String::from("ラウトは難しいです!");
s.len() // returns 30.

The only way I have found to get the number of characters is using the following function:

s.chars().count()

which returns 10, and is the correct number of characters.

Is there any method on String that returns the characters count, aside from the one I am using above?

nbro
  • 15,395
  • 32
  • 113
  • 196
Salvatore Cosentino
  • 6,663
  • 6
  • 17
  • 25
  • 4
    Note that given Unicode idiosyncrasies, "number of characters" probably doesn't mean what you think it does. For example, this string: "é" has _two_ characters as evidenced in the playground: https://play.rust-lang.org/?gist=143ea763c0b16bd4ee12e628fb7ff4ca&version=stable, although this string: "é" only has one character: https://play.rust-lang.org/?gist=af950651bb6394e7bc2a966147e1b035&version=stable – Jmb Sep 19 '17 at 07:35
  • 3
    see also https://crates.io/crates/unicode-segmentation – user25064 Sep 19 '17 at 13:39
  • See also this example with byte representation of a list of characters: [Rust Playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=14eb9b3b2408e5c3024bccfda4918a0c) – Claudio Fsr May 10 '23 at 12:23

1 Answers1

112

Is there any method on String that returns the characters count, aside from the one I am using above?

No. Using s.chars().count() is correct. Note that this is an O(N) operation (because UTF-8 is complex) while getting the number of bytes is an O(1) operation.

You can see all the methods on str for yourself.

As pointed out in the comments, a char is a specific concept:

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.

One such example is with precomposed characters:

fn main() {
    println!("{}", "é".chars().count()); // 2
    println!("{}", "é".chars().count()); // 1
}

You may prefer to use graphemes from the unicode-segmentation crate instead:

use unicode_segmentation::UnicodeSegmentation; // 1.6.0

fn main() {
    println!("{}", "é".graphemes(true).count()); // 1
    println!("{}", "é".graphemes(true).count()); // 1
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 7
    BTW ˋs.chars().count()ˋ is the number of unicode codepoints, you can use [unicode-segmentation](https://crates.io/crates/unicode-segmentation) to split on graphemes. – Grégory OBANOS Sep 19 '17 at 05:52
  • @Shepmaster, thank you fro your answer. I knew the chars and String are different, as you can guess from my question. I was simply wondering if there was a more efficient, and intuitive way to do it. – Salvatore Cosentino Sep 20 '17 at 08:11
  • @GrégoryOBANOS thank you for your comment, but I am not planning to install anything for something that should be simple. – Salvatore Cosentino Sep 20 '17 at 08:11
  • 15
    @SalvatoreCosentino Bluntly put, counting the characters in a string is **not simple** (see also [Why is capitalizing the first letter of a string so convoluted in Rust?](https://stackoverflow.com/q/38406793/155423)) and you will be greatly disserviced if you avoid using Rust crates. Many programmers are under the wrong impression that dealing with natural language should be "easy", allowing many programs to simply get it wrong. Rust is trying very hard to avoid that fate. – Shepmaster Sep 20 '17 at 12:37
  • Why do .len() as well as the two methods mentioned by you add 1 to the result when they deal with strings composed of latin characters? E.g. ```let mut ih = String::new(); io::stdin().read_line(&mut ih).expect("Failed to read line"); println!("The length of the hash that you gave me, according to .graphemes(true).count(), is {}.", ih.graphemes(true).count());``` – John Smith May 02 '21 at 08:06
  • 1
    @JerzyBrzóska I think you are experiencing [How to ignore the line break while printing a string read from stdin?](https://stackoverflow.com/q/43567092/155423) – Shepmaster May 03 '21 at 14:02
  • no method named `graphemes` found for reference `&str` in the current scope method not found in `&str` – Jim Mar 13 '23 at 18:35
  • @Jim please read the full answer: *You may prefer to use [`graphemes`](https://docs.rs/unicode-segmentation/1.6.0/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.graphemes) **from the [unicode-segmentation](https://crates.io/crates/unicode-segmentation) crate** instead* – Shepmaster Apr 14 '23 at 20:53
  • I'm not sure why the following is showing two different results for this command in the above comment, but it will always print 2. `println!("{}", "é".chars().count()); // 2` – Amit L. Jun 02 '23 at 14:48