76

How do I do a case-insensitive string comparison?

From what I understood from Google and the link above that both functions: lower() and casefold() will convert the string to lowercase, but casefold() will convert even the caseless letters such as the ß in German to ss.

All of that about Greek letters, but my question in general:

  • are there any other differences?
  • which one is better to convert to lowercase?
  • which one is better to check the matching strings?

Part 2:

firstString = "der Fluß"
secondString = "der Fluss"

# ß is equivalent to ss
if firstString.casefold() == secondString.casefold():
    print('The strings are equal.')
else:
    print('The strings are not equal.')

In the example above should I use:

lower() # the result is not equal which make sense to me

Or:

casefold() # which ß is ss and result is the
        # strings are equal. (since I am a beginner that still does not
        # make sense to me. I see different strings).
Georgy
  • 12,464
  • 7
  • 65
  • 73

4 Answers4

85

TL;DR

  • Converting to Lowercase -> lower()
  • Caseless String matching/comparison -> casefold()

casefold() is a text normalization function like lower() that is specifically designed to remove upper- or lower-case distinctions for the purposes of comparison. It is another form of normalizing text that may initially appear to be very similar to lower() because generally, the results are the same. As of Unicode 13.0.0, only ~300 of ~150,000 characters produced differing results when passed through lower() and casefold(). @dlukes' answer has the code to identify the characters that generate those differing results.

To answer your other two questions:

  • use lower() when you specifically want to ensure a character is lowercase, like for presenting to users or persisting data
  • use casefold() when you want to compare that result to another casefold-ed value.

Other Material

I suggest you take a closer look into what case folding actually is, so here's a good start: W3 Case Folding Wiki

Another source: Elastic.co Case Folding

Edit: I just recently found another very good related answer to a slightly different question here on SO (doing a case-insensitive string comparison)


Performance

Using this snippet, you can get a sense for the performance between the two:

import sys
from timeit import timeit

unicode_codepoints = tuple(map(chr, range(sys.maxunicode)))

def compute_lower():
    return tuple(codepoint.lower() for codepoint in unicode_codepoints)

def compute_casefold():
    return tuple(codepoint.casefold() for codepoint in unicode_codepoints)

timer_repeat = 1000

print(f"time to compute lower on unicode namespace: {timeit(compute_lower, number = timer_repeat) / timer_repeat} seconds")
print(f"time to compute casefold on unicode namespace: {timeit(compute_casefold, number = timer_repeat) / timer_repeat} seconds")

print(f"number of distinct characters from lower: {len(set(compute_lower()))}")
print(f"number of distinct characters from casefold: {len(set(compute_casefold()))}")

Running this, you'll get the results that the two are overwhelmingly the same in both performance and the number of distinct characters returned

time to compute lower on unicode namespace: 0.137255663 seconds
time to compute casefold on unicode namespace: 0.136321374 seconds
number of distinct characters from lower: 1112719
number of distinct characters from casefold: 1112694

If you run the numbers, that means it takes about 1.6e-07 seconds to run the computation on a single character for either function, so there isn't a performance benefit either way.

David Culbreth
  • 2,610
  • 16
  • 26
  • 7
    To nitpick: It's naïve to think that English language text contains only ASCII characters, although these days it might be rather rare. All those loan words from French and other languages. – Voo Jun 16 '19 at 16:46
  • @Voo, you are correct in saying that applications dealing with the English language may encounter non-English data, however, that's why I specified `with our simple 26-letter alphabet`. Casefolding is dramatically more effective at normalizing information for internationalized text. – David Culbreth Jun 16 '19 at 22:34
  • 3
    @Dave My point is that English does not have a "simple 26-letter alphabet". Naïve with ï is a valid, if rare, English spelling. Some people still use diaereses and so on. (Raymond Chen from the oldnewthing or the New Yorker come to mind) – Voo Jun 17 '19 at 06:45
  • 3
    `.lower()` vs `.casefold()` has nothing to do with ASCII vs. Unicode, please see [my answer](https://stackoverflow.com/a/74702121/1826241) for details. – dlukes Dec 06 '22 at 11:58
  • 2
    @dlukes is totally right. Go see his answer. – David Culbreth Dec 07 '22 at 16:23
25

Both .lower() and .casefold() act on the full range of Unicode codepoints

There's some confusion in the existing answers, even the accepted one (EDIT: I was referring to this currently outdated version; the current one is fine). The distinction between .lower() and .casefold() has nothing to do with ASCII vs. Unicode, both act on the whole Unicode range of codepoints, just in slightly different ways. But both perform relatively complex mappings which they need to look up in the Unicode database, for instance:

>>> "Ť".lower()
'ť'

Both can involve single-to-multiple codepoint mappings, like we saw with "ß".casefold(). Look what happens to ß when you apply .lower()'s counterpart .upper():

>>> "ß".upper()
'SS'

And the one example I found where .lower() also does this:

>>> list("İ".lower())
['i', '̇']

So the performance claims, like "lower() will require less memory or less time because there are no lookups, and it's only dealing with 26 characters it has to transform", are simply not true.

The vast majority of the time, both operations yield the same thing, but there are a few cases (297 as of Unicode 13.0.0) where they don't. You can identify them like this:

import sys
import unicodedata as ud

print("Unicode version:", ud.unidata_version, "\n")
total = 0
for codepoint in map(chr, range(sys.maxunicode)):
    lower, casefold = codepoint.lower(), codepoint.casefold()
    if lower != casefold:
        total += 1
        for conversion, converted in zip(
            ("orig", "lower", "casefold"),
            (codepoint, lower, casefold)
        ):
            print(conversion, [ud.name(cp) for cp in converted], converted)
        print()
print("Total differences:", total)

When to use which

The Unicode standard covers lowercasing as part of Default Case Conversion in Section 3.13, and Default Case Folding is described right below that. The first paragraph says:

Case folding is related to case conversion. However, the main purpose of case folding is to contribute to caseless matching of strings, whereas the main purpose of case conversion is to put strings into a particular cased form.

My rule of thumb based on this:

  • Want to display a lowercased version of a string to users? Use .lower().
  • Want to do case-insensitive string comparison? Use .casefold().

(As a sidenote, I routinely break this rule of thumb and use .lower() across the board, just because it's shorter to type, the output is overwhelmingly the same, and what differences there are don't affect the languages I typically come across and work with. Don't be like me though ;) )

Just to hammer home that in terms of complexity, both operations are basically the same, they just use slightly different mappings -- this is Unicode's abstract definition of lowercasing:

R2 toLowercase(X): Map each character C in X to Lowercase_Mapping(C).

And this is its abstract definition of case folding:

R4 toCasefold(X): Map each character C in X to Case_Folding(C).

In Python's official documentation

The Python docs are quite clear that this is what the respective methods do, they even point the user to the aforementioned Section 3.13.

They describe .lower() as converting cased characters to lowercase, where cased characters are "those with general category property being one of “Lu” (Letter, uppercase), “Ll” (Letter, lowercase), or “Lt” (Letter, titlecase)". Same with .upper() and uppercase.

With .casefold(), the docs explicitly state that it's meant for "caseless matching", and that it's "similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string".

dlukes
  • 1,313
  • 16
  • 27
  • How should str.casefold() be used in comparison? If you compare a string literal to a variable x, and the string literal is written in lowercase, is it OK to have it as 'lowercase string' == x.casefold()? Or is it standard to call casefold on both operands: 'lowercase string'.casefold() == x.casefold()? In general, is x.casefold() == y.casefold() the way casefolding is meant to be used for comparison, calling it on both operands? – Stephen Frost Feb 02 '23 at 20:48
  • 1
    @StephenFrost Both operands should be casefolded in a comparison. But if one of them is a literal, you can of course just make sure the literal is already casefolded, so you can avoid having to call the method on it. – dlukes Feb 03 '23 at 22:31
  • 1
    @dlukes another key difference between lower() and casefold(), sometimes removing case distinction generates uppercase characters. For one language I work with: "ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ".lower() -> 'ꮳꮃꭹ ꭶꮼꮒꭿꮝꮧ' while "ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ".casefold() -> 'ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ'. – Andj Mar 03 '23 at 06:56
0

If i may add an explaination for the "ß to ss": The "ß" cannot be uppercased, "SS" though can. This probably explains why they decided to replace it, although it's from a technical point of view unnecessary and can even lead to problems when dealing with names (for example, if you want to know if a person "MICHAEL WEISS" exists and you want to have a case-insensitive search through items, .casefold will (incorrectly) also list "Michael Weiß" (both variants existing). This makes it in some situations the undesired choice and one has to always keep that in mind. The good thing is, on the other hand, that you get a better support for people without a German keyboard who then can actually find "Michael Weiß" by typing in "MICHAEL WEISS".

  • Actually "ß" can be uppercased. Normally, in Standard German, "ß" is uppercased to "SS", although there are cases where it may be uppercased to "SZ". In 2017, there was a change to the Standard German Orthography where "ß" could be optionally uppercased to "ẞ". "ẞ" will lowercase to "ß". Lowercasing "SS" is always a problem. Casefolding, is different. both "ß" and "ẞ" will fold to "ss". This has more to do with removing case distinctions, since Standard Swiss German doesn't use "ß". – Andj Aug 16 '23 at 14:33
  • "Michael Weiß" is also "Michael Weiss", and "MICHAEL WEISS" is also "MICHAEL WEIẞ". – Andj Aug 16 '23 at 14:34
-1
print("∑∂˜∂ˆ´ˆˆçµµ∂˚ß˚ø≤∑∑π".casefold())     #∑∂˜∂ˆ´ˆˆçμμ∂˚ss˚ø≤∑∑π
print("∑∂˜∂ˆ´ˆˆçµµ∂˚ß˚ø≤∑∑π".lower())        #∑∂˜∂ˆ´ˆˆçµµ∂˚ß˚ø≤∑∑π

Was playing around and only casefold found the character 'ß'. Might just stick with casefold if its more accurate even by the slightest.

Woody1193
  • 7,252
  • 5
  • 40
  • 90
  • 1
    The key thing to remember is that lower() is a casing operation used to transform the text in a specific way. Casefolding is part of a caseless matching algorithm and should not be used to transform text, it is used to compare or match, so shouldn't change the data that is being compared. For cases like "ß" vs "ss", in actual data they are orthographic distinctions in Standard German and usage is contrastive with Swiss Standard German. – Andj Mar 03 '23 at 06:49