769

How can I compare strings in a case insensitive way in Python?

I would like to encapsulate comparison of a regular strings to a repository string, using simple and Pythonic code. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Kozyarchuk
  • 21,049
  • 14
  • 40
  • 46

15 Answers15

790

Assuming ASCII strings:

string1 = 'Hello'
string2 = 'hello'

if string1.lower() == string2.lower():
    print("The strings are the same (case insensitive)")
else:
    print("The strings are NOT the same (case insensitive)")

As of Python 3.3, casefold() is a better alternative:

string1 = 'Hello'
string2 = 'hello'

if string1.casefold() == string2.casefold():
    print("The strings are the same (case insensitive)")
else:
    print("The strings are NOT the same (case insensitive)")

If you want a more comprehensive solution that handles more complex unicode comparisons, see other answers.

kevinpo
  • 1,853
  • 19
  • 20
Harley Holcombe
  • 175,848
  • 15
  • 70
  • 63
  • 89
    That doesn’t always work. Consider for exanmple that there are two Greek sigmas, one only used at the end. The string *Σίσυφος* (“Sísyphos”, or better “Síſyphos”) has all three: uppercase at the front, lowercase final at the end, and lowercase nonfinal at the third position. If your two strings are `Σίσυφος` and `ΣΊΣΥΦΟΣ`, then your approach fails, because those are supposed to be the same case insensitively. – tchrist Jul 19 '12 at 13:42
  • 65
    @ The last two commenters: I think it's fair to assume both strings are ascii strings. If you're looking for an answer to something a bit more exciting I'm sure it's out there (or you can ask it). – Harley Holcombe Jul 20 '12 at 01:34
  • 7
    The .lower() approach will work in Python 3, for the two Greek strings mentioned above, at least. See my answer for more details. – Nathan Craike Jul 20 '12 at 05:28
  • 1
    @tchrist Are there any examples where the .lower() approach doesn't work in Python 3? The Greek example you've given seems to work fine in Python 3. Also, it'd be great if you could post a solution that does handle these edge cases correctly, even if it's using a third-party module like pyICU. – Nathan Craike Jul 24 '12 at 10:50
  • 28
    Problem: `'ß'.lower() == 'SS'.lower()` is False. – kennytm Aug 28 '13 at 14:10
  • 1
    @KennyTM, why would that be a problem? https://en.wikipedia.org/wiki/Capital_%E1%BA%9E seems to be lowered correctly. – exic Dec 05 '13 at 16:00
  • 13
    Greek letters is not the only special case! In U.S. English, the character "i" (\u0069) is the lowercase version of the character "I" (\u0049). However, the Turkish ("tr-TR") alphabet includes an "I with a dot" character "İ" (\u0130), which is the capital version of "i" and "I" is the captical version of "i without a dot" character, "ı" (\u0131). – Gqqnbig Dec 10 '13 at 02:08
  • 6
    @exic, that Wikipedia article is pretty clear that according to most Germans, "Capital Eszett" is not a real letter. It's encoded in Unicode so that there is a representation for certain typographic curiosities, but it's irrelevant to KennyTM's point. (That is, you are arguing that German and Turkish should change their writing systems to play better with Python semantics, but it's more usual to argue the opposite: that Python should find a way to handle German and Turkish writing systems as they are used by real German and Turkish people.) – Quuxplusone Mar 31 '14 at 18:56
  • 34
    @HarleyHolcombe how is it safe (or fair) to assume the strings are ascii? The question did not specify, and if the strings are at any point entered by or show to a user, then you should be supporting internationalization. Regardless, new programmers will be reading this and we should give them the truly correct answer. – Ethan Reesor Apr 27 '16 at 18:28
  • 1
    To commenters above, this answer is fine. It works fine. If you want to pass non-English Latin languages, Greek languages, Cyrillic languages, Armenian languages, or strange characters, then see @Veedrac 's answer. – user3932000 Jul 07 '16 at 10:54
  • 8
    @user3932000 In other words, this answer is only fine when you're dealing with text that is truly exclusively English. For most people, namely people whose native language isn't English, people who have to deal with l10n/i18n issues, and people who have to deal with Unicode input sanitation, that means this answer is **wrong**. –  Aug 19 '16 at 12:31
  • 5
    @Rhymoid yes. It doesn't work even for "exclusively English" text e.g., `"fish".casefold() == "Fish".casefold()` works while `.lower()` fails here. Though there may be cases [even `.casefold()` is not enough](http://stackoverflow.com/a/40551443/4279) – jfs Nov 11 '16 at 21:24
  • 4
    @user3932000 Then the answer is largely pointless in any professional context. This is _not_ the correct way to compare strings in a case-insensitive manner. It's a workaround that doesn't break in some specific cases. – Basic Dec 02 '16 at 17:28
  • @Basic You're right that it's pointless in a professional context. No client will want an algorithm that'll break so easily with unconventional or non-English input. But for personal purposes, this algorithm is practical and perfectly fine. – user3932000 Dec 02 '16 at 19:15
  • @Quuxplusone as it was to be expected, in June 2017, the ẞ became part of the official german orthography. The question’s algorithm however remains incorrect, e.g. because of the turkish example and others. – flying sheep May 22 '18 at 11:57
  • 3
    @HarleyHolcombe Replacing `.lower()` with `.casefold()` is so easy and avoids so many problems (not all), we should teach new programmers to use the much-improved version and avoid teaching stuff which only works with a fraction of global users. – Marcel Waldvogel Nov 11 '20 at 15:26
  • 7
    "Assuming ASCII" was already anachronistic in 2008. In 2021, it is simply ignorant. Just because SO is in English doesn't mean that solutions only have to work for English. – A. Donda Jul 07 '21 at 02:33
712

Comparing strings in a case insensitive way seems trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

The first thing to note is that case-removing conversions in Unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

>>> "ß".lower()
'ß'
>>> "ß".upper().lower()
'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:

str.casefold()

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. [...]

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:

>>> "ê" == "ê"
False

This is because the accent on the latter is a combining character.

>>> import unicodedata
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E WITH CIRCUMFLEX']
>>> [unicodedata.name(char) for char in "ê"]
['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

>>> unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
True

To finish up, here this is expressed in functions:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • 16
    A better solution is to normalize all your strings on intake, then you can just do `x.casefold() == y.casefold()` for case-insensitive comparisons (and, more importantly, `x == y` for case-sensitive). – abarnert May 01 '15 at 10:44
  • 7
    @abarnert Indeed, depending on context - sometimes it's better to leave the source intact but upfront normalization can also make later code much simpler. – Veedrac May 01 '15 at 12:13
  • 5
    @Veedrac: You're right, it's not always appropriate; if you need to be able to output the original source unchanged (e.g., because you're dealing with filenames on Linux, where NKFC and NKFD are both allowed and explicitly supposed to be different), obviously you can't transform it on input… – abarnert May 01 '15 at 22:14
  • 8
    Unicode Standard section 3.13 has two other definitions for caseless comparisons: (D146, canonical) `NFD(toCasefold(NFD(str)))` on both sides and (D147, compatibility) `NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))` on both sides. It states the inner `NFD` is solely to handle a certain Greek accent character. I guess it's all about the edge cases. –  Apr 12 '16 at 17:06
  • 2
    And a bit of fun with the Cherokee alphabet, where casefold() goes to uppercase:>>> "ᏚᎢᎵᎬᎢᎬᏒ".upper() 'ᏚᎢᎵᎬᎢᎬᏒ' >>> "ᏚᎢᎵᎬᎢᎬᏒ".lower() 'ꮪꭲꮅꭼꭲꭼꮢ' >>> "ᏚᎢᎵᎬᎢᎬᏒ".casefold() 'ᏚᎢᎵᎬᎢᎬᏒ' >>> – bortzmeyer Oct 02 '17 at 18:41
  • 1
    If you are using Python 2 you might want to check out [py2casefold](https://pypi.python.org/pypi/py2casefold) to get the missing `casefold` functionality. – kuzzooroo Nov 26 '17 at 16:51
  • no, I would certainly not want "Busse" and "Buße" to be considered equal. They are two different words with different pronounciation and totally different meaning. – jakun Oct 11 '19 at 06:34
  • 1
    @jakun [“SS” is the traditional capitalization of “ß”](https://en.wikipedia.org/wiki/%C3%9F#Capital_form), so “BUSSE” is the capitalization of “buße”, no? Therefore “BUSSE” should be case insensitively equal to “buße”. – Veedrac Oct 11 '19 at 16:58
  • @Veedrac true. I think my comment was a little too harsh, sorry about that. I believe the capital ẞ is not that widely known yet and the Duden still allows SS instead. Therefore "BUSSE" is indeed ambiguous. The fact that I read "BUSSE" first as a capitalized form of "Busse" is not only because of the outdated capitalization but probably more because "Buße" is less common than "Busse". Nevertheless, `"Busse".casefold() == "Buße".casefold()` returns `True` and that is imho wrong. – jakun Oct 14 '19 at 13:46
  • 1
    NFKD is probably too aggressive normalization for most string comparison use cases. That produces results like `'①' == '1'` or `'︷' == '{'`. – user2357112 Aug 09 '21 at 18:56
  • @jakun, not 100% true! Busse and Buße can be equals! What do you do if a Swiss is using your script? ß is always written with ss in Switzerland (no exception) and it is legit to use ss instead of ß in Germany as well, especially if you don't have a German keyboard (eg. UK, US, CH, FR and so on). A difference between Busse vs. Buße/Busse can only be found in the context of a sentence. Comparing Busse and Buße should be equals. Otherwise comparing the name Süßkind with Suesskind on a flight ticket would never be equals if the function would work as you expect. ;) – Thomas Sep 23 '21 at 15:25
65

Using Python 2, calling .lower() on each string or Unicode object...

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.

Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.

However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ

>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)

Kaushik NP
  • 6,733
  • 9
  • 31
  • 60
Nathan Craike
  • 5,031
  • 2
  • 24
  • 19
  • 20
    To make the comparison even more robust, starting with Python 3.3 you can use casefold (e.g., first.casefold() == second.casefold()). For Python 2 you can use PyICU (see also: http://icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#a76f9027fbe4aa6f5b863c2a4a7148078) – kgriffs Jan 02 '14 at 16:38
59

Section 3.13 of the Unicode standard defines algorithms for caseless matching.

X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":

import unicodedata

def NFD(text):
    return unicodedata.normalize('NFD', text)

def canonical_caseless(text):
    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.

Example:

>>> 'å'.casefold() == 'å'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('å')
True

There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 6
    This is the best answer for Python 3, because Python 3 uses Unicode strings and the answer describes how the Unicode standard defines caseless string matching. – SergiyKolesnikov Dec 23 '16 at 17:23
  • Unfortunately, as of Python 3.6, the `casefold()` function does not implement the special case treatment of uppercase I and dotted uppercase I as described in [Case Folding Properties](http://www.unicode.org/Public/9.0.0/ucd/CaseFolding.txt). Therefore, the comparison may fail for words from Turkic languages that contain those letters. For example, `canonical_caseless('LİMANI') == canonical_caseless('limanı')` must return `True`, but it returns `False`. Currently, the only way to deal with this in Python is to write a casefold wrapper or use an external Unicode library, such as PyICU. – SergiyKolesnikov Dec 23 '16 at 18:17
  • @SergiyKolesnikov .casefold() behaves as it should as far as I can tell. From the standard: *"the default casing operations are intended for use in the **absence** of tailoring for particular languages and environments"*. Casing rules for the Turkish dotted capital I and dotless small i are in SpecialCasing.txt. *"For non-Turkic languages, this mapping is normally not used."* From the Unicode FAQ: [Q: Why aren't there extra characters encoded to support locale-independent casing for Turkish?](http://unicode.org/faq/casemap_charprop.html#9) – jfs Dec 23 '16 at 20:13
  • 1
    @j-f-sebastian I didn't say that casefold() misbehaves. It just would be practical if it implemented an optional parameter that enabled the special treatment of uppercase and dotted uppercase I. For example, the way [the foldCase() in the ICU library does it](https://ssl.icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#a0924f873180947aab38b7380da638533): "Case-folding is locale-independent and not context-sensitive, but there is an option for whether to include or exclude mappings for dotted I and dotless i that are marked with 'T' in CaseFolding.txt." – SergiyKolesnikov Dec 23 '16 at 22:02
  • @jfs Thanks for sharing this solution. It worked for me. – Lead Developer Jan 19 '21 at 07:59
10

You can use casefold() method. The casefold() method ignores cases when comparing.

firstString = "Hi EVERYONE"
secondString = "Hi everyone"

if firstString.casefold() == secondString.casefold():
    print('The strings are equal.')
else:
    print('The strings are not equal.')

Output:

The strings are equal.
mpriya
  • 823
  • 8
  • 15
9

I saw this solution here using regex.

import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True

It works well with accents

In [42]: if re.search("ê","ê", re.IGNORECASE):
....:        print(1)
....:
1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:

In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....:        print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....:        print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....:        print(1)
....:
Community
  • 1
  • 1
Shiwangi
  • 193
  • 1
  • 7
  • 7
    The fact that `ß` is not found within `SS` with case-insensitive search is evidence that it **doesn't work** work with Unicode characters **at all**. –  Aug 19 '16 at 12:34
4

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

>>> "hello".upper() == "HELLO".upper()
True
>>> 
Andru Luvisi
  • 24,367
  • 6
  • 53
  • 66
2

How about converting to lowercase first? you can use string.lower().

Camilo Díaz Repka
  • 4,805
  • 5
  • 43
  • 68
  • 5
    You cannot compare their lowercase maps: `Σίσυφος` and `ΣΊΣΥΦΟΣ` would not test equivalent, but should. – tchrist Jul 19 '12 at 14:27
2

a clean solution that I found, where I'm working with some constant file extensions.

from pathlib import Path


class CaseInsitiveString(str):
   def __eq__(self, __o: str) -> bool:
      return self.casefold() == __o.casefold()

GZ = CaseInsitiveString(".gz")
ZIP = CaseInsitiveString(".zip")
TAR = CaseInsitiveString(".tar")

path = Path("/tmp/ALL_CAPS.TAR.GZ")

GZ in path.suffixes, ZIP in path.suffixes, TAR in path.suffixes, TAR == ".tAr"

# (True, False, True, True)
Jason Leaver
  • 286
  • 2
  • 11
  • 1
    Thanks for this! This is a great trick for getting Python "builtins" to work, like list.index() and "in list" to work. – aghast Aug 23 '22 at 03:07
  • Would anything else need to be implemented, for Case Insensitive Strings to work nicely as dictionary keys? – Ryan Leach Nov 16 '22 at 09:55
  • Yeah you would need to define the `__hash__` method in which case you are probably better off using an `class StrEnum(str, Enum):...` – Jason Leaver Nov 17 '22 at 14:10
0

You can mention case=False in the str.contains()

data['Column_name'].str.contains('abcd', case=False)
mpriya
  • 823
  • 8
  • 15
0
def search_specificword(key, stng):
    key = key.lower()
    stng = stng.lower()
    flag_present = False
    if stng.startswith(key+" "):
        flag_present = True
    symb = [',','.']
    for i in symb:
        if stng.find(" "+key+i) != -1:
            flag_present = True
    if key == stng:
        flag_present = True
    if stng.endswith(" "+key):
        flag_present = True
    if stng.find(" "+key+" ") != -1:
        flag_present = True
    print(flag_present)
    return flag_present

Output: search_specificword("Affordable housing", "to the core of affordable outHousing in europe") False

search_specificword("Affordable housing", "to the core of affordable Housing, in europe") True

zackakshay
  • 41
  • 2
0
from re import search, IGNORECASE

def is_string_match(word1, word2):
    #  Case insensitively function that checks if two words are the same
    # word1: string
    # word2: string | list

    # if the word1 is in a list of words
    if isinstance(word2, list):
        for word in word2:
            if search(rf'\b{word1}\b', word, IGNORECASE):
                return True
        return False

    # if the word1 is same as word2
    if search(rf'\b{word1}\b', word2, IGNORECASE):
        return True
    return False

is_match_word = is_string_match("Hello", "hELLO") 
True

is_match_word = is_string_match("Hello", ["Bye", "hELLO", "@vagavela"])
True

is_match_word = is_string_match("Hello", "Bye")
False
0

Consider using FoldedCase from jaraco.text:

>>> from jaraco.text import FoldedCase
>>> FoldedCase('Hello World') in ['hello world']
True

And if you want a dictionary keyed on text irrespective of case, use FoldedCaseKeyedDict from jaraco.collections:

>>> from jaraco.collections import FoldedCaseKeyedDict
>>> d = FoldedCaseKeyedDict()
>>> d['heLlo'] = 'world'
>>> list(d.keys()) == ['heLlo']
True
>>> d['hello'] == 'world'
True
>>> 'hello' in d
True
>>> 'HELLO' in d
True
Jason R. Coombs
  • 41,115
  • 10
  • 83
  • 93
-3
def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)
Patrick Harrington
  • 47,416
  • 5
  • 23
  • 20
  • 6
    You are replacing an excepting by a message printed to stdout, then returning None, which is False. That is very unhelpful in practice. – gerrit Jun 12 '17 at 18:10
-3

This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling! make a normal function.... ask for input, then use ....something = re.compile(r'foo*|spam*', yes.I)...... re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!

You then search your message using regex's but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored. Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using "return lost_n_found.lower()"

This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for "no one cares seriously...!" or not case sensitive....whichever

if anyone has any questions get me on this..

    import re as yes

    def bar_or_spam():

        message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ") 

        message_in_coconut = yes.compile(r'foo*|spam*',  yes.I)

        lost_n_found = message_in_coconut.search(message).group()

        if lost_n_found != None:
            return lost_n_found.lower()
        else:
            print ("Make tea not love")
            return

    whatz_for_breakfast = bar_or_spam()

    if whatz_for_breakfast == foo:
        print ("BaR")

    elif whatz_for_breakfast == spam:
        print ("EgGs")