How can I case fold a string in Rust?

Question

I'm writing a simple full text search library, and need case folding to check if two words are equal. For this use case, the existing .to_lowercase() and .to_uppercase() methods are not enough.

From a quick search of crates.io, I can find libraries for normalization and word splitting but not case folding. regex-syntax does have case folding code, but it's not exposed in its API.

What exactly is insufficient about those methods? It's hard to answer your question without knowing the problem you're trying to solve. There are also methods defined on char: https://doc.rust-lang.org/std/primitive.char.html#method.to_lowercase — BurntSushi5, Oct 25 '16 at 23:27
@BurntSushi5 I've added some context to the question -- hope it helps. — Lambda Fairy, Oct 25 '16 at 23:45
Your best bet is probably https://docs.rs/caseless/0.1.1/caseless/ — BurntSushi5, Oct 26 '16 at 20:54

score 4 · Answer 1 · answered Nov 01 '16 at 06:51

For my use case, I've found the caseless crate to be most useful.

As far as I know, this is the only library which supports normalization. This is important when you want e.g. "㎒" (U+3392 SQUARE MHZ) and "mhz" to match. See Chapter 3 - Default Caseless Matching in the Unicode Standard for details on how this works.

Here's some example code that matches a string case-insensitively:

extern crate caseless;
use caseless::Caseless;

let a = "100 ㎒";
let b = "100 mhz";

// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

To get the case folded string directly, you can use the default_case_fold_str function:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless doesn't expose a corresponding function that normalizes as well, but you can write one using the unicode-normalization crate:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;

fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}

let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

Note that multiple rounds of normalization and case folding are needed for a correct result.

(Thanks to BurntSushi5 for pointing me to this library.)

This answer is five years old. Would you do it differently today? — ccleve, Nov 03 '21 at 18:58
@ccleve See this other answer: https://stackoverflow.com/a/75526819/59081 — Tanveer Badar, Apr 18 '23 at 05:53

lucatrv · Accepted Answer · 2023-03-09T22:46:43.897

As of today (2023) the caseless crate looks unmaintained, while the ICU4X project seems the way to go. To apply case folding see the icu_casemapping crate. To compare strings according to language-dependent conventions, see the icu_collator crate. For a good introduction on how to correctly sort words in Rust, see here.

For documentation on Unicode theory and algorithms, see the Unicode Standard. In particular:

Case conversion and case folding: sections 3.13 and 5.18.
Collation algorithm

For documentation on the ICU4X project, see here.

To use ICU4X, you can either add the main crate icu to Cargo.toml and access the single modules (for instance icu::collator, icu::datetime, etc), or otherwise add just the single crates that you actually need (for instance icu_collator, icu_datetime, etc).

To check if two words are equal regardless from case, you can apply full case folding to the strings and then check binary equality. For this you need the icu_casemapping::full_fold method, and a data provider such as icu_testdata::unstable. Note that currently the data for icu_casemapping is hidden behind the feature icu_testdata/icu_casemapping, so you need to import it explicitly in you Cargo.toml file as:

[dependencies]
icu_casemapping = "0.7.1"
icu_testdata = { version = "1.1.2", features = ["icu_casemapping"] }

In future feature icu_testdata/icu_casemapping may be added to icu_testdata's default features, as the icu_casemapping is stabilized.

Here is a simple example using icu_casemapping::full_fold method:

use icu_casemapping::CaseMapping;

fn main() {
    let str1 = "Hello";
    let str2 = "hello";
    assert_ne!(str1, str2);
    let case_mapping = CaseMapping::try_new(&icu_testdata::unstable()).unwrap();
    assert_eq!(case_mapping.full_fold(str1), case_mapping.full_fold(str2));
}

Note that currently the icu_casemapping crate does not include normalization, this may be added in future, see discussion here.

Otherwise to compare strings according to language-dependent conventions you can use the icu_collator crate, which allows to customize several options such as strengths and locale. You can find several examples here.

score 2 · Answer 3 · answered Dec 25 '20 at 21:54

If someone did want to stick to the standard library, I wanted some actual data on this. I pulled the full list of two byte characters that fail with to_lowercase or to_uppercase. I then ran this test:

fn lowercase(left: char, right: char) -> bool {
   for c in left.to_lowercase() {
      for d in right.to_lowercase() {
         if c == d { return true }
      }
   }
   false
}

fn uppercase(left: char, right: char) -> bool {
   for c in left.to_uppercase() {
      for d in right.to_uppercase() {
         if c == d { return true }
      }
   }
   false
}

fn main() {
   let pairs = &[
      &['\u{00E5}','\u{212B}'],&['\u{00C5}','\u{212B}'],&['\u{0399}','\u{1FBE}'],
      &['\u{03B9}','\u{1FBE}'],&['\u{03B2}','\u{03D0}'],&['\u{03B5}','\u{03F5}'],
      &['\u{03B8}','\u{03D1}'],&['\u{03B8}','\u{03F4}'],&['\u{03D1}','\u{03F4}'],
      &['\u{03B9}','\u{1FBE}'],&['\u{0345}','\u{03B9}'],&['\u{0345}','\u{1FBE}'],
      &['\u{03BA}','\u{03F0}'],&['\u{00B5}','\u{03BC}'],&['\u{03C0}','\u{03D6}'],
      &['\u{03C1}','\u{03F1}'],&['\u{03C2}','\u{03C3}'],&['\u{03C6}','\u{03D5}'],
      &['\u{03C9}','\u{2126}'],&['\u{0392}','\u{03D0}'],&['\u{0395}','\u{03F5}'],
      &['\u{03D1}','\u{03F4}'],&['\u{0398}','\u{03D1}'],&['\u{0398}','\u{03F4}'],
      &['\u{0345}','\u{1FBE}'],&['\u{0345}','\u{0399}'],&['\u{0399}','\u{1FBE}'],
      &['\u{039A}','\u{03F0}'],&['\u{00B5}','\u{039C}'],&['\u{03A0}','\u{03D6}'],
      &['\u{03A1}','\u{03F1}'],&['\u{03A3}','\u{03C2}'],&['\u{03A6}','\u{03D5}'],
      &['\u{03A9}','\u{2126}'],&['\u{0398}','\u{03F4}'],&['\u{03B8}','\u{03F4}'],
      &['\u{03B8}','\u{03D1}'],&['\u{0398}','\u{03D1}'],&['\u{0432}','\u{1C80}'],
      &['\u{0434}','\u{1C81}'],&['\u{043E}','\u{1C82}'],&['\u{0441}','\u{1C83}'],
      &['\u{0442}','\u{1C84}'],&['\u{0442}','\u{1C85}'],&['\u{1C84}','\u{1C85}'],
      &['\u{044A}','\u{1C86}'],&['\u{0412}','\u{1C80}'],&['\u{0414}','\u{1C81}'],
      &['\u{041E}','\u{1C82}'],&['\u{0421}','\u{1C83}'],&['\u{1C84}','\u{1C85}'],
      &['\u{0422}','\u{1C84}'],&['\u{0422}','\u{1C85}'],&['\u{042A}','\u{1C86}'],
      &['\u{0463}','\u{1C87}'],&['\u{0462}','\u{1C87}']
   ];

   let (mut upper, mut lower) = (0, 0);

   for pair in pairs.iter() {
      print!("U+{:04X} ", pair[0] as u32);
      print!("U+{:04X} pass: ", pair[1] as u32);
      if uppercase(pair[0], pair[1]) {
         print!("to_uppercase ");
         upper += 1;
      } else {
         print!("             ");
      }
      if lowercase(pair[0], pair[1]) {
         print!("to_lowercase");
         lower += 1;
      }
      println!();
   }

   println!("upper pass: {}, lower pass: {}", upper, lower);
}

Result below. Interestingly, one of the pairs fails with both. But based on this, to_uppercase is the best option.

U+00E5 U+212B pass:              to_lowercase
U+00C5 U+212B pass:              to_lowercase
U+0399 U+1FBE pass: to_uppercase
U+03B9 U+1FBE pass: to_uppercase
U+03B2 U+03D0 pass: to_uppercase
U+03B5 U+03F5 pass: to_uppercase
U+03B8 U+03D1 pass: to_uppercase
U+03B8 U+03F4 pass:              to_lowercase
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: to_uppercase
U+0345 U+03B9 pass: to_uppercase
U+0345 U+1FBE pass: to_uppercase
U+03BA U+03F0 pass: to_uppercase
U+00B5 U+03BC pass: to_uppercase
U+03C0 U+03D6 pass: to_uppercase
U+03C1 U+03F1 pass: to_uppercase
U+03C2 U+03C3 pass: to_uppercase
U+03C6 U+03D5 pass: to_uppercase
U+03C9 U+2126 pass:              to_lowercase
U+0392 U+03D0 pass: to_uppercase
U+0395 U+03F5 pass: to_uppercase
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: to_uppercase
U+0398 U+03F4 pass:              to_lowercase
U+0345 U+1FBE pass: to_uppercase
U+0345 U+0399 pass: to_uppercase
U+0399 U+1FBE pass: to_uppercase
U+039A U+03F0 pass: to_uppercase
U+00B5 U+039C pass: to_uppercase
U+03A0 U+03D6 pass: to_uppercase
U+03A1 U+03F1 pass: to_uppercase
U+03A3 U+03C2 pass: to_uppercase
U+03A6 U+03D5 pass: to_uppercase
U+03A9 U+2126 pass:              to_lowercase
U+0398 U+03F4 pass:              to_lowercase
U+03B8 U+03F4 pass:              to_lowercase
U+03B8 U+03D1 pass: to_uppercase
U+0398 U+03D1 pass: to_uppercase
U+0432 U+1C80 pass: to_uppercase
U+0434 U+1C81 pass: to_uppercase
U+043E U+1C82 pass: to_uppercase
U+0441 U+1C83 pass: to_uppercase
U+0442 U+1C84 pass: to_uppercase
U+0442 U+1C85 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+044A U+1C86 pass: to_uppercase
U+0412 U+1C80 pass: to_uppercase
U+0414 U+1C81 pass: to_uppercase
U+041E U+1C82 pass: to_uppercase
U+0421 U+1C83 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+0422 U+1C84 pass: to_uppercase
U+0422 U+1C85 pass: to_uppercase
U+042A U+1C86 pass: to_uppercase
U+0463 U+1C87 pass: to_uppercase
U+0462 U+1C87 pass: to_uppercase
upper pass: 46, lower pass: 8

score 1 · Answer 4 · answered Oct 26 '16 at 00:20

The unicase crate doesn't expose case folding directly, but it provides a generic wrapper type that implements Eq, Ord and Hash in a case insensitive manner. The master branch (unreleased) supports both ASCII case folding (as an optimization) and Unicode case folding (though only invariant case folding is supported).

How can I case fold a string in Rust?

4 Answers4

Linked