8

I'm writing a simple full text search library, and need case folding to check if two words are equal. For this use case, the existing .to_lowercase() and .to_uppercase() methods are not enough.

From a quick search of crates.io, I can find libraries for normalization and word splitting but not case folding. regex-syntax does have case folding code, but it's not exposed in its API.

Lambda Fairy
  • 13,814
  • 7
  • 42
  • 68
  • What exactly is insufficient about those methods? It's hard to answer your question without knowing the problem you're trying to solve. There are also methods defined on char: https://doc.rust-lang.org/std/primitive.char.html#method.to_lowercase – BurntSushi5 Oct 25 '16 at 23:27
  • @BurntSushi5 I've added some context to the question -- hope it helps. – Lambda Fairy Oct 25 '16 at 23:45
  • 3
    Your best bet is probably https://docs.rs/caseless/0.1.1/caseless/ – BurntSushi5 Oct 26 '16 at 20:54

4 Answers4

4

For my use case, I've found the caseless crate to be most useful.

As far as I know, this is the only library which supports normalization. This is important when you want e.g. "㎒" (U+3392 SQUARE MHZ) and "mhz" to match. See Chapter 3 - Default Caseless Matching in the Unicode Standard for details on how this works.

Here's some example code that matches a string case-insensitively:

extern crate caseless;
use caseless::Caseless;

let a = "100 ㎒";
let b = "100 mhz";

// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

To get the case folded string directly, you can use the default_case_fold_str function:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless doesn't expose a corresponding function that normalizes as well, but you can write one using the unicode-normalization crate:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;

fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}

let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

Note that multiple rounds of normalization and case folding are needed for a correct result.

(Thanks to BurntSushi5 for pointing me to this library.)

Lambda Fairy
  • 13,814
  • 7
  • 42
  • 68
3

As of today (2023) the caseless crate looks unmaintained, while the ICU4X project seems the way to go. To apply case folding see the icu_casemapping crate. To compare strings according to language-dependent conventions, see the icu_collator crate. For a good introduction on how to correctly sort words in Rust, see here.

For documentation on Unicode theory and algorithms, see the Unicode Standard. In particular:

For documentation on the ICU4X project, see here.

To use ICU4X, you can either add the main crate icu to Cargo.toml and access the single modules (for instance icu::collator, icu::datetime, etc), or otherwise add just the single crates that you actually need (for instance icu_collator, icu_datetime, etc).

To check if two words are equal regardless from case, you can apply full case folding to the strings and then check binary equality. For this you need the icu_casemapping::full_fold method, and a data provider such as icu_testdata::unstable. Note that currently the data for icu_casemapping is hidden behind the feature icu_testdata/icu_casemapping, so you need to import it explicitly in you Cargo.toml file as:

[dependencies]
icu_casemapping = "0.7.1"
icu_testdata = { version = "1.1.2", features = ["icu_casemapping"] }

In future feature icu_testdata/icu_casemapping may be added to icu_testdata's default features, as the icu_casemapping is stabilized.

Here is a simple example using icu_casemapping::full_fold method:

use icu_casemapping::CaseMapping;

fn main() {
    let str1 = "Hello";
    let str2 = "hello";
    assert_ne!(str1, str2);
    let case_mapping = CaseMapping::try_new(&icu_testdata::unstable()).unwrap();
    assert_eq!(case_mapping.full_fold(str1), case_mapping.full_fold(str2));
}

Note that currently the icu_casemapping crate does not include normalization, this may be added in future, see discussion here.

Otherwise to compare strings according to language-dependent conventions you can use the icu_collator crate, which allows to customize several options such as strengths and locale. You can find several examples here.

lucatrv
  • 725
  • 8
  • 14
2

If someone did want to stick to the standard library, I wanted some actual data on this. I pulled the full list of two byte characters that fail with to_lowercase or to_uppercase. I then ran this test:

fn lowercase(left: char, right: char) -> bool {
   for c in left.to_lowercase() {
      for d in right.to_lowercase() {
         if c == d { return true }
      }
   }
   false
}

fn uppercase(left: char, right: char) -> bool {
   for c in left.to_uppercase() {
      for d in right.to_uppercase() {
         if c == d { return true }
      }
   }
   false
}

fn main() {
   let pairs = &[
      &['\u{00E5}','\u{212B}'],&['\u{00C5}','\u{212B}'],&['\u{0399}','\u{1FBE}'],
      &['\u{03B9}','\u{1FBE}'],&['\u{03B2}','\u{03D0}'],&['\u{03B5}','\u{03F5}'],
      &['\u{03B8}','\u{03D1}'],&['\u{03B8}','\u{03F4}'],&['\u{03D1}','\u{03F4}'],
      &['\u{03B9}','\u{1FBE}'],&['\u{0345}','\u{03B9}'],&['\u{0345}','\u{1FBE}'],
      &['\u{03BA}','\u{03F0}'],&['\u{00B5}','\u{03BC}'],&['\u{03C0}','\u{03D6}'],
      &['\u{03C1}','\u{03F1}'],&['\u{03C2}','\u{03C3}'],&['\u{03C6}','\u{03D5}'],
      &['\u{03C9}','\u{2126}'],&['\u{0392}','\u{03D0}'],&['\u{0395}','\u{03F5}'],
      &['\u{03D1}','\u{03F4}'],&['\u{0398}','\u{03D1}'],&['\u{0398}','\u{03F4}'],
      &['\u{0345}','\u{1FBE}'],&['\u{0345}','\u{0399}'],&['\u{0399}','\u{1FBE}'],
      &['\u{039A}','\u{03F0}'],&['\u{00B5}','\u{039C}'],&['\u{03A0}','\u{03D6}'],
      &['\u{03A1}','\u{03F1}'],&['\u{03A3}','\u{03C2}'],&['\u{03A6}','\u{03D5}'],
      &['\u{03A9}','\u{2126}'],&['\u{0398}','\u{03F4}'],&['\u{03B8}','\u{03F4}'],
      &['\u{03B8}','\u{03D1}'],&['\u{0398}','\u{03D1}'],&['\u{0432}','\u{1C80}'],
      &['\u{0434}','\u{1C81}'],&['\u{043E}','\u{1C82}'],&['\u{0441}','\u{1C83}'],
      &['\u{0442}','\u{1C84}'],&['\u{0442}','\u{1C85}'],&['\u{1C84}','\u{1C85}'],
      &['\u{044A}','\u{1C86}'],&['\u{0412}','\u{1C80}'],&['\u{0414}','\u{1C81}'],
      &['\u{041E}','\u{1C82}'],&['\u{0421}','\u{1C83}'],&['\u{1C84}','\u{1C85}'],
      &['\u{0422}','\u{1C84}'],&['\u{0422}','\u{1C85}'],&['\u{042A}','\u{1C86}'],
      &['\u{0463}','\u{1C87}'],&['\u{0462}','\u{1C87}']
   ];

   let (mut upper, mut lower) = (0, 0);

   for pair in pairs.iter() {
      print!("U+{:04X} ", pair[0] as u32);
      print!("U+{:04X} pass: ", pair[1] as u32);
      if uppercase(pair[0], pair[1]) {
         print!("to_uppercase ");
         upper += 1;
      } else {
         print!("             ");
      }
      if lowercase(pair[0], pair[1]) {
         print!("to_lowercase");
         lower += 1;
      }
      println!();
   }

   println!("upper pass: {}, lower pass: {}", upper, lower);
}

Result below. Interestingly, one of the pairs fails with both. But based on this, to_uppercase is the best option.

U+00E5 U+212B pass:              to_lowercase
U+00C5 U+212B pass:              to_lowercase
U+0399 U+1FBE pass: to_uppercase
U+03B9 U+1FBE pass: to_uppercase
U+03B2 U+03D0 pass: to_uppercase
U+03B5 U+03F5 pass: to_uppercase
U+03B8 U+03D1 pass: to_uppercase
U+03B8 U+03F4 pass:              to_lowercase
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: to_uppercase
U+0345 U+03B9 pass: to_uppercase
U+0345 U+1FBE pass: to_uppercase
U+03BA U+03F0 pass: to_uppercase
U+00B5 U+03BC pass: to_uppercase
U+03C0 U+03D6 pass: to_uppercase
U+03C1 U+03F1 pass: to_uppercase
U+03C2 U+03C3 pass: to_uppercase
U+03C6 U+03D5 pass: to_uppercase
U+03C9 U+2126 pass:              to_lowercase
U+0392 U+03D0 pass: to_uppercase
U+0395 U+03F5 pass: to_uppercase
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: to_uppercase
U+0398 U+03F4 pass:              to_lowercase
U+0345 U+1FBE pass: to_uppercase
U+0345 U+0399 pass: to_uppercase
U+0399 U+1FBE pass: to_uppercase
U+039A U+03F0 pass: to_uppercase
U+00B5 U+039C pass: to_uppercase
U+03A0 U+03D6 pass: to_uppercase
U+03A1 U+03F1 pass: to_uppercase
U+03A3 U+03C2 pass: to_uppercase
U+03A6 U+03D5 pass: to_uppercase
U+03A9 U+2126 pass:              to_lowercase
U+0398 U+03F4 pass:              to_lowercase
U+03B8 U+03F4 pass:              to_lowercase
U+03B8 U+03D1 pass: to_uppercase
U+0398 U+03D1 pass: to_uppercase
U+0432 U+1C80 pass: to_uppercase
U+0434 U+1C81 pass: to_uppercase
U+043E U+1C82 pass: to_uppercase
U+0441 U+1C83 pass: to_uppercase
U+0442 U+1C84 pass: to_uppercase
U+0442 U+1C85 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+044A U+1C86 pass: to_uppercase
U+0412 U+1C80 pass: to_uppercase
U+0414 U+1C81 pass: to_uppercase
U+041E U+1C82 pass: to_uppercase
U+0421 U+1C83 pass: to_uppercase
U+1C84 U+1C85 pass: to_uppercase
U+0422 U+1C84 pass: to_uppercase
U+0422 U+1C85 pass: to_uppercase
U+042A U+1C86 pass: to_uppercase
U+0463 U+1C87 pass: to_uppercase
U+0462 U+1C87 pass: to_uppercase
upper pass: 46, lower pass: 8
Zombo
  • 1
  • 62
  • 391
  • 407
1

The unicase crate doesn't expose case folding directly, but it provides a generic wrapper type that implements Eq, Ord and Hash in a case insensitive manner. The master branch (unreleased) supports both ASCII case folding (as an optimization) and Unicode case folding (though only invariant case folding is supported).

Francis Gagné
  • 60,274
  • 7
  • 180
  • 155