Regex Pattern with Unicode doesn't do case folding

Question

In C# it appears that Grüsse and Grüße are considered equal in most circumstances as is explained by this nice webpage. I'm trying to find a similar behavior in Java - obviously not in java.lang.String.

I thought I was in luck with java.regex.Pattern in combination with Pattern.UNICODE_CASE. The Javadoc says:

UNICODE_CASE enables Unicode-aware case folding. When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard.

Yet the following code:

Pattern p = Pattern.compile(Pattern.quote("Grüsse"), 
                     Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
System.out.println(p.matcher("Grüße").matches());

yields false. Why? And is there an alternative way of reproducing the C# case folding behavior?

---- edit ----

As @VGR pointed out, String.toUpperCase will convert ß to ss, which may or may not be case folding (maybe I'm confusing concepts here). However other characters in the German locale are not "folded", for instance ü does not become UE. So to make my initial example more complete, is there a way to make Grüße and Gruesse compare equal in Java?

I was thinking the java.text.Normalizer class could be used to do just that, but it converts ü to u? rather than ue. It also hasn't an option to provide a Locale, which confuses me even more.

why: the javadoc also says: "This class is in conformance with Level 1 of [Unicode Technical Standard #18: Unicode Regular Expression](http://www.unicode.org/reports/tr18/), plus RL2.1 Canonical Equivalents." - and the linked doc states "at Level 1 only simple case matches are necessary" - probably will not work with the actual java SE implementation... — user85421, Jan 13 '17 at 15:31
I’m pretty sure `s1.toUpperCase().equals(s2.toUpperCase())` will work. equalsIgnoreCase will not work, because, as its documentation says, it performs case conversions one character at a time, so only one-to-one character mappings are applied. — VGR, Jan 13 '17 at 16:00
Thanks both for your insightful comments. @VGR you're right, this works. But I've thrown in another question - please see edit. — geert3, Jan 13 '17 at 19:58
`"SS"` equals `"ß".toUppercase(Locale.GERMAN)"` but this does not hold for `"ss".equals("ß")`. — Joop Eggen, Jan 13 '17 at 20:09
A [Collator](http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html) with its strength set to [PRIMARY](http://docs.oracle.com/javase/8/docs/api/java/text/Collator.html#PRIMARY) will accomplish your original goal. But I’m not sure if it can be made to equate “Grüße” and “Gruesse”. — VGR, Jan 13 '17 at 20:42
Why is it that case is so difficult to get right in Regular Expressions? Sometimes when using find and replace tools based on Regular Expressions, things like case really matter a great deal, so why standard Regex evaluation engines have not yet been able to universally deal with this issue perfectly? — FinnTheHuman, Jan 18 '17 at 03:07

score 1 · Answer 1 · answered Jan 18 '17 at 03:03

1

Use the ICU4J regular expressions, not the JDK ones: http://userguide.icu-project.org/strings/regexp#TOC-Case-Insensitive-Matching

answered Jan 18 '17 at 03:03

Mihai Nita

5,547
27
27

score 1 · Answer 2 · answered Jun 28 '21 at 09:33

With the currently accepted answer:

foo.toUpperCase().equals(bar.toUpperCase())

The following inputs do not compare equal even though they should: Grüsse and GRÜẞE; or Grüße and GRÜẞE.

Why is that? Let's look at the uppercased strings:

"Grüsse".toUpperCase(Locale.ROOT)  -> "GRÜSSE"
"Grüße".toUpperCase(Locale.ROOT)   -> "GRÜSSE"
"GRÜẞE".toUpperCase(Locale.ROOT)   -> "GRÜẞE"

As you can see, the uppercase "sharp S" (ẞ) stays that way. To handle that correctly, do this:

foo.toLowerCase(Locale.ROOT).toUpperCase(Locale.ROOT).equals(
    bar.toLowerCase(Locale.ROOT).toUpperCase(Locale.ROOT))

Note that the order is important. If you uppercase first and then lowercase, it would turn ẞ into ß (lowercase sharp S) only.

score 0 · Accepted Answer · answered Jan 18 '17 at 09:21

For reference, the following facts:

Character.toUpperCase() cannot do case folding, as one character must map to one character.
String.toUpperCase() will do case folding.
String.equalsIgnoreCase() uses Character.toUpperCase() internally, so doesn't do case folding.

Conclusion (as @VGR pointed out): if you need case insensitive matching with case folding, you need to do:

foo.toUpperCase().equals(bar.toUpperCase())

and not:

foo.equalsIgnoreCase(bar)

As for the ü and ue equality, I've managed to do it with a RuleBasedCollator and my own rules (one would expect Locale.German had that built-in but alas). It looked really silly/over-engineered, and since I needed only the equality, not the sorting/collating, in the end I've settled for a simple set of String.replace prior to comparison. It sucks but it works and is transparent/readable.

Note that Unicode distinguishes between *simple* case folding (where 1 character is replaced by 1 other) and *full* case folding. *Full* case folding is required to turn `ß` into `ss`. For all the mappings, see [CaseFolding.txt](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) in Unicode. — robinst, Jun 28 '21 at 06:59
This is not true in general. For example, it will not work for comparing "ﬂüßchen" to "FLÜSSCHEN", or under regex using either of these patterns: ".*(ß).*", ".*(ss).*" — Hoobajoob, Dec 06 '21 at 16:18

Regex Pattern with Unicode doesn't do case folding

3 Answers3