53

What is the easiest way to compare strings in Python, ignoring case?

Of course one can do (str1.lower() <= str2.lower()), etc., but this created two additional temporary strings (with the obvious alloc/g-c overheads).

I guess I'm looking for an equivalent to C's stricmp().

[Some more context requested, so I'll demonstrate with a trivial example:]

Suppose you want to sort a looong list of strings. You simply do theList.sort(). This is O(n * log(n)) string comparisons and no memory management (since all strings and list elements are some sort of smart pointers). You are happy.

Now, you want to do the same, but ignore the case (let's simplify and say all strings are ascii, so locale issues can be ignored). You can do theList.sort(key=lambda s: s.lower()), but then you cause two new allocations per comparison, plus burden the garbage-collector with the duplicated (lowered) strings. Each such memory-management noise is orders-of-magnitude slower than simple string comparison.

Now, with an in-place stricmp()-like function, you do: theList.sort(cmp=stricmp) and it is as fast and as memory-friendly as theList.sort(). You are happy again.

The problem is any Python-based case-insensitive comparison involves implicit string duplications, so I was expecting to find a C-based comparisons (maybe in module string).

Could not find anything like that, hence the question here. (Hope this clarifies the question).

tzot
  • 92,761
  • 29
  • 141
  • 204
Paul Oyster
  • 2,237
  • 3
  • 18
  • 13
  • Php Equivalent: strcasecmp - http://nl3.php.net/strcasecmp – fijter Sep 15 '08 at 13:04
  • 4
    your assumptions are wrong. list.sort() with a key= does _not_ mean "two new allocations per comparison". (list.sort with the cmp=, on the other hand _does_ call the argument for each comparison) –  Sep 23 '08 at 13:45
  • attempted to rename the question from `Ignore case in python strings` to `What's closest to stricmp in Python for 7-bit ascii string comparison?` to more accurately reflect the op's actual question. main problem: unicode is also 'string' but this question would get them *totally* wrong; see comments by tchrist – n611x007 Apr 15 '14 at 12:14
  • related: [How do I case fold a string in Python 2?](http://stackoverflow.com/q/18271077/4279) – jfs Sep 29 '15 at 08:35

16 Answers16

74

Here is a benchmark showing that using str.lower is faster than the accepted answer's proposed method (libc.strcasecmp):

#!/usr/bin/env python2.7
import random
import timeit

from ctypes import *
libc = CDLL('libc.dylib') # change to 'libc.so.6' on linux

with open('/usr/share/dict/words', 'r') as wordlist:
    words = wordlist.read().splitlines()
random.shuffle(words)
print '%i words in list' % len(words)

setup = 'from __main__ import words, libc; gc.enable()'
stmts = [
    ('simple sort', 'sorted(words)'),
    ('sort with key=str.lower', 'sorted(words, key=str.lower)'),
    ('sort with cmp=libc.strcasecmp', 'sorted(words, cmp=libc.strcasecmp)'),
]

for (comment, stmt) in stmts:
    t = timeit.Timer(stmt=stmt, setup=setup)
    print '%s: %.2f msec/pass' % (comment, (1000*t.timeit(10)/10))

typical times on my machine:

235886 words in list
simple sort: 483.59 msec/pass
sort with key=str.lower: 1064.70 msec/pass
sort with cmp=libc.strcasecmp: 5487.86 msec/pass

So, the version with str.lower is not only the fastest by far, but also the most portable and pythonic of all the proposed solutions here. I have not profiled memory usage, but the original poster has still not given a compelling reason to worry about it. Also, who says that a call into the libc module doesn't duplicate any strings?

NB: The lower() string method also has the advantage of being locale-dependent. Something you will probably not be getting right when writing your own "optimised" solution. Even so, due to bugs and missing features in Python, this kind of comparison may give you wrong results in a unicode context.

  • 3
    Of course memory is an issue, since more than 99.9% of the .lower() time is memory allocation. Also, on the (windows) machines I checked, the key=_stricmp approach was 4-5 times faster, and with no memory pnalty. – Paul Oyster Oct 07 '08 at 12:17
  • 4
    4-5 times faster than the .lower-method would mean that it is 2 times faster than the simple sort case. how can that be?!? –  Oct 10 '08 at 19:21
  • @hop all the words in the word list you test on are already lowercased. That might give you results that are far from Paul's. – Virgil Dupras Jan 16 '10 at 19:08
  • @hop again: nevermind. I tried sorting the same list with str.upper, and the results are about the same. – Virgil Dupras Jan 16 '10 at 19:13
  • @virgil: also, in my locale more than two thirds of all words start with a capital letter ;) –  Jan 16 '10 at 19:21
  • @hop the analysis you presented is really nice. this show (especially to me) that sometimes its better not to count cycles and bytes in mind. – Xolve Nov 13 '10 at 13:08
  • A question for clarification: what do you mean by `str.lower` is locale-dependent? I'm asking because there is no other mention of locale-related stuff in your answer, like e.g. calling `locale.setlocale`. – tzot Mar 22 '11 at 22:05
  • @ΤΖΩΤΖΙΟΥ: from the docs: "For 8-bit strings, this method is locale-dependent." i don't need to setlocale explicitely, since it is handled by my system configuration. –  Mar 22 '11 at 23:22
  • @hop: my own benchmark shows the libc method to be 3x faster than .lower(). What you're actually demonstrating is that sorting with key is more efficient than sorting with cmp. – bukzor May 17 '11 at 21:54
  • @hop: then your statement "str.lower is the fastest by far" is pretty misleading. – bukzor May 18 '11 at 17:12
  • 3
    This is also wrong, because if you don’t use Unicode casefolds, you get all kinds of wrong answers. – tchrist Aug 11 '11 at 16:38
  • @tchrist: well, yes, but 1) that applies to each and every solution and 2) with strcasecmp you can't even fix that problem. –  Aug 11 '11 at 21:42
  • 2
    @hop: Check [bugs.python.org](http://bugs.python.org/issue?%40columns=id%2Cactivity%2Ctitle%2Ccreator%2Cassignee%2Cstatus%2Ctype&%40sort=-activity&%40filter=status&%40action=searchid&ignore=file%3Acontent&%40search_text=unicode&submit=search&status=-1%2C1%2C2%2C3) for Unicode bugs. I’ve just put up a bunch of test cases showing where Python messes up by not using casefolding. If I have to choose between fast and correct, I know which one I’ll pick every single time. – tchrist Aug 11 '11 at 21:54
  • 1
    @tchrist: i don't /quite/ get where you are going with this. what's your concrete proposed solution? –  Aug 12 '11 at 09:55
  • 1
    @hop: There needs to be a string method that provides the Unicode casefold, then you just do a simple comparison on that. Don’t mess around with ugly antiportable locales, gosh they’re as bad as code GAG pages in µsløth. `string.lower()`, `string.title()`, `string.upper()` you have already, but you need `string.fold()` so you can do `str1.fold() == str2.fold()`. It’s missing from the Python API. Notice how ICU has it. That’s because it’s what you need; it’s *how* you do case insensitive comparisons on Unicode text without diving into the Unicode Collation Algorithm. – tchrist Aug 12 '11 at 10:31
  • @tchrist: is there also a _real_ solution you can offer? some of us have to ship code _now_. i really appreciate your knowledge and engagement regarding this topic, but using `key=` with your `fold()` is still better than using `cmp=` with your `fold()`, and PyICU will still be better than hacking something together with ctypes, so what? is? your? point? –  Aug 12 '11 at 13:30
  • @tchrist: OK, since you had to get personal, this is the last I will say on this. The question was about concerns regarding memory allocations, this answer is about showing that optimising for string duplication will not improve but lessen performance, and YOU don't even have a shred of evidence that there is human language involved in any of this. The OP could be comparing base64 strings for all you know. So stop being so sanctimonious, please. –  Aug 12 '11 at 17:26
  • 7
    Its probably best to avoid referring to someone else's answer as "stupid". – Chris Dutrow Nov 09 '11 at 17:46
  • @DutrowLLC: yeah, you're probably not right. –  Nov 10 '11 at 14:07
  • 1
    This test appears to be completely omitting the time that the `.lower()` version spends on the garbage collector.. right? – Izkata Dec 21 '12 at 19:20
  • @Izkata: yes. [`Timer.timeit` disables the garbage collection during the benchmark](https://hg.python.org/cpython/file/2.7/Lib/timeit.py#l178). – jfs Jan 11 '15 at 13:41
  • Enabeling the gc makes the relative performance of the libc method even worse. I've updated my answer to run timeit() with gc enabled. –  Jan 15 '15 at 13:53
7

Are you using this compare in a very-frequently-executed path of a highly-performance-sensitive application? Alternatively, are you running this on strings which are megabytes in size? If not, then you shouldn't worry about the performance and just use the .lower() method.

The following code demonstrates that doing a case-insensitive compare by calling .lower() on two strings which are each almost a megabyte in size takes about 0.009 seconds on my 1.8GHz desktop computer:

from timeit import Timer

s1 = "1234567890" * 100000 + "a"
s2 = "1234567890" * 100000 + "B"

code = "s1.lower() < s2.lower()"
time = Timer(code, "from __main__ import s1, s2").timeit(1000)
print time / 1000   # 0.00920499992371 on my machine

If indeed this is an extremely significant, performance-critical section of code, then I recommend writing a function in C and calling it from your Python code, since that will allow you to do a truly efficient case-insensitive search. Details on writing C extension modules can be found here: https://docs.python.org/extending/extending.html

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
Eli Courtwright
  • 186,300
  • 67
  • 213
  • 256
  • 3
    so this is how you pass stuff to the Timer class. thanks for solving a very different itch of mine :) – Manav Mar 24 '11 at 13:32
  • 5
    This is completely wrong. It fails to detect that *ΣΤΙΓΜΑΣ* and *στιγμας* are the same case insenstively. You must not use casemapping to compare case in Unicode. You must use casefolding. These are different things. *Σ, σ, ς* are all the same, just as *S, ſ, s* (what is it with s’s anyway? :) and *Μ, μ, µ* are. There are innumerably may other similar circumstances, like how *weiß, WEIẞ, weiss, WEISS* are all the same too, or *efficient, efficient.* You **must use casefolds,** because casemaps don’t work. – tchrist Aug 11 '11 at 16:47
7

Your question implies that you don't need Unicode. Try the following code snippet; if it works for you, you're done:

Python 2.5.2 (r252:60911, Aug 22 2008, 02:34:17)
[GCC 4.3.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US")
'en_US'
>>> sorted("ABCabc", key=locale.strxfrm)
['a', 'A', 'b', 'B', 'c', 'C']
>>> sorted("ABCabc", cmp=locale.strcoll)
['a', 'A', 'b', 'B', 'c', 'C']

Clarification: in case it is not obvious at first sight, locale.strcoll seems to be the function you need, avoiding the str.lower or locale.strxfrm "duplicate" strings.

tzot
  • 92,761
  • 29
  • 141
  • 204
  • 4
    The global setting of locale.setlocale() is obviously an overkill (way too global). – Paul Oyster Sep 16 '08 at 08:24
  • I don't know what the "obvious overkill" is, and the "global" setting can be as localized as you like (except if you work with threads and need some threads localized and some not, for some reason). – tzot Oct 29 '08 at 23:52
  • 1
    This is the only solution that produces results that can interoperate correctly with case insensitive utilities such as Unix sort with the -f option. For example, str.lower causes A_ to sort before AA. – Neil Mayhew Jan 09 '11 at 01:38
  • 3
    You cannot use POSIX locales and strcoll, because it is unreliable across platforms. You must use Unicode casefolds, which are guaranteed to work the same everywhere. – tchrist Aug 11 '11 at 16:39
6

I can't find any other built-in way of doing case-insensitive comparison: The python cook-book recipe uses lower().

However you have to be careful when using lower for comparisons because of the Turkish I problem. Unfortunately Python's handling for Turkish Is is not good. ı is converted to I, but I is not converted to ı. İ is converted to i, but i is not converted to İ.

Douglas Leeder
  • 52,368
  • 9
  • 94
  • 137
  • 4
    Python doesn’t handle Unicode very robustly, as you have seen. The casemaps don’t pay attention to these things. Very sad. – tchrist Aug 11 '11 at 16:50
3

There's no built in equivalent to that function you want.

You can write your own function that converts to .lower() each character at a time to avoid duplicating both strings, but I'm sure it will very cpu-intensive and extremely inefficient.

Unless you are working with extremely long strings (so long that can cause a memory problem if duplicated) then I would keep it simple and use

str1.lower() == str2.lower()

You'll be ok

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Ricardo Reyes
  • 13,256
  • 4
  • 27
  • 19
  • 2
    "Never say never" :) "There is no built in equivalent" is absolute; "I know of no built in equivalent" would be closer to the truth. locale.strcoll, given a case-insensitive LC_COLLATE (as 'en_US' is), is a built-in. – tzot Sep 15 '08 at 22:29
  • 2
    This answer is wrong. The only correct way is `str1.fold() == str2.fold()`, but that requires an extension to the default python string class that supports the full Unicode casefold of a string. It’s a missing function. – tchrist Aug 11 '11 at 21:56
  • @tchrist unclearr: is there such an extension available? – n611x007 Apr 15 '14 at 12:05
2

When something isn't supported well in the standard library, I always look for a PyPI package. With virtualization and the ubiquity of modern Linux distributions, I no longer avoid Python extensions. PyICU seems to fit the bill: https://stackoverflow.com/a/1098160/3461

There now is also an option that is pure python. It's well tested: https://github.com/jtauber/pyuca


Old answer:

I like the regular expression solution. Here's a function you can copy and paste into any function, thanks to python's block structure support.

def equals_ignore_case(str1, str2):
    import re
    return re.match(re.escape(str1) + r'\Z', str2, re.I) is not None

Since I used match instead of search, I didn't need to add a caret (^) to the regular expression.

Note: This only checks equality, which is sometimes what is needed. I also wouldn't go so far as to say that I like it.

Community
  • 1
  • 1
Benjamin Atkin
  • 14,071
  • 7
  • 61
  • 60
  • [I wish there was a virtual rubber-stamp for this] Don't use `$`, use `\Z`. Read the fantastic manual to find out what `$` actually does; don't rely on legend or guesswork or whatever. – John Machin Apr 26 '10 at 03:54
  • I changed it. I also turned on the community wiki feature for my answer. Thanks. – Benjamin Atkin Apr 26 '10 at 04:38
  • Good only for equality testing, which isn't as quite the same thing as comparing two strings and determining whether one is less-than, equal-to, or greater-than the other. – martineau Aug 29 '14 at 14:09
  • @martineau thanks. I added a note, and also did some searching and found a solution that I think I'd be more comfortable with, and updated my answer with it. It isn't a full answer, though. Hopefully someone (myself if I get around to it) will learn how one of these libraries works and provide a code sample. – Benjamin Atkin Sep 03 '14 at 19:10
  • 1
    Yes, it sounds like the pyuca (Python Unicode Collation Algorithm) extension might work because the report it's based on -- the [Unicode Collation Algorithm (UCA)](http://unicode.org/reports/tr10/) -- says "Case differences (uppercase versus lowercase), are typically ignored". – martineau Sep 03 '14 at 20:16
2

This question is asking 2 very different things:

  1. What is the easiest way to compare strings in Python, ignoring case?
  2. I guess I'm looking for an equivalent to C's stricmp().

Since #1 has been answered very well already (ie: str1.lower() < str2.lower()) I will answer #2.

def strincmp(str1, str2, numchars=None):
    result = 0
    len1 = len(str1)
    len2 = len(str2)
    if numchars is not None:
        minlen = min(len1,len2,numchars)
    else:
        minlen = min(len1,len2)
    #end if
    orda = ord('a')
    ordz = ord('z')

    i = 0
    while i < minlen and 0 == result:
        ord1 = ord(str1[i])
        ord2 = ord(str2[i])
        if ord1 >= orda and ord1 <= ordz:
            ord1 = ord1-32
        #end if
        if ord2 >= orda and ord2 <= ordz:
            ord2 = ord2-32
        #end if
        result = cmp(ord1, ord2)
        i += 1
    #end while

    if 0 == result and minlen != numchars:
        if len1 < len2:
            result = -1
        elif len2 < len1:
            result = 1
        #end if
    #end if

    return result
#end def

Only use this function when it makes sense to as in many instances the lowercase technique will be superior.

I only work with ascii strings, I'm not sure how this will behave with unicode.

trevorcroft
  • 167
  • 1
  • 3
1

This is how you'd do it with re:

import re
p = re.compile('^hello$', re.I)
p.match('Hello')
p.match('hello')
p.match('HELLO')
Moses Ting
  • 152
  • 1
  • 2
  • 7
  • Case-insensitive regular expressions can only be used for equality tests (True/False), not comparison (less than/equal/greater than) – tzot Sep 15 '08 at 22:26
1

The recommended idiom to sort lists of values using expensive-to-compute keys is to the so-called "decorated pattern". It consists simply in building a list of (key, value) tuples from the original list, and sort that list. Then it is trivial to eliminate the keys and get the list of sorted values:

>>> original_list = ['a', 'b', 'A', 'B']
>>> decorated = [(s.lower(), s) for s in original_list]
>>> decorated.sort()
>>> sorted_list = [s[1] for s in decorated]
>>> sorted_list
['A', 'a', 'B', 'b']

Or if you like one-liners:

>>> sorted_list = [s[1] for s in sorted((s.lower(), s) for s in original_list)]
>>> sorted_list
['A', 'a', 'B', 'b']

If you really worry about the cost of calling lower(), you can just store tuples of (lowered string, original string) everywhere. Tuples are the cheapest kind of containers in Python, they are also hashable so they can be used as dictionary keys, set members, etc.

Antoine P.
  • 4,181
  • 1
  • 24
  • 17
  • tupples are cheap, but the duplication of strings is not... – Paul Oyster Sep 16 '08 at 08:25
  • 2
    this is also what python's sort with the key= argument does. –  Sep 23 '08 at 14:28
  • 1
    This is a 7-bit mindset that is wholly inappropriate for Unicode data. You must either use the full Unicode casefold, or else the primary collation strength per the Unicode Collation Algorithm. Yes, that means new copies of the string either way, but at least then you can do a binary comparison instead of having to rummage through the tables for each code point. – tchrist Aug 11 '11 at 16:50
0

Just use the str().lower() method, unless high-performance is important - in which case write that sorting method as a C extension.

"How to write a Python Extension" seems like a decent intro..

More interestingly, This guide compares using the ctypes library vs writing an external C module (the ctype is quite-substantially slower than the C extension).

dbr
  • 165,801
  • 69
  • 278
  • 343
0
import re
if re.match('tEXT', 'text', re.IGNORECASE):
    # is True
Venkatesh Bachu
  • 2,348
  • 1
  • 18
  • 28
0

I'm pretty sure you either have to use .lower() or use a regular expression. I'm not aware of a built-in case-insensitive string comparison function.

Mark Biek
  • 146,731
  • 54
  • 156
  • 201
0

For occasional or even repeated comparisons, a few extra string objects shouldn't matter as long as this won't happen in the innermost loop of your core code or you don't have enough data to actually notice the performance impact. See if you do: doing things in a "stupid" way is much less stupid if you also do it less.

If you seriously want to keep comparing lots and lots of text case-insensitively you could somehow keep the lowercase versions of the strings at hand to avoid finalization and re-creation, or normalize the whole data set into lowercase. This of course depends on the size of the data set. If there are a relatively few needles and a large haystack, replacing the needles with compiled regexp objects is one solution. If It's hard to say without seeing a concrete example.

0

You could translate each string to lowercase once --- lazily only when you need it, or as a prepass to the sort if you know you'll be sorting the entire collection of strings. There are several ways to attach this comparison key to the actual data being sorted, but these techniques should be addressed in a separate issue.

Note that this technique can be used not only to handle upper/lower case issues, but for other types of sorting such as locale specific sorting, or "Library-style" title sorting that ignores leading articles and otherwise normalizes the data before sorting it.

Dale Wilson
  • 9,166
  • 3
  • 34
  • 52
  • the question is more general than the example itself (actually, in real life scenarios you don't want to be bothered by attaching a lowercase version to every string that might need icmp() later), but even in this trivial example, you don't want to double the memory only to be able to sort... – Paul Oyster Sep 15 '08 at 20:15
-1

You could subclass str and create your own case-insenstive string class but IMHO that would be extremely unwise and create far more trouble than it's worth.

David Webb
  • 190,537
  • 57
  • 313
  • 299
-11

In response to your clarification...

You could use ctypes to execute the c function "strcasecmp". Ctypes is included in Python 2.5. It provides the ability to call out to dll and shared libraries such as libc. Here is a quick example (Python on Linux; see link for Win32 help):

from ctypes import *
libc = CDLL("libc.so.6")  // see link above for Win32 help
libc.strcasecmp("THIS", "this") // returns 0
libc.strcasecmp("THIS", "THAT") // returns 8

may also want to reference strcasecmp documentation

Not really sure this is any faster or slower (have not tested), but it's a way to use a C function to do case insensitive string comparisons.

~~~~~~~~~~~~~~

ActiveState Code - Recipe 194371: Case Insensitive Strings is a recipe for creating a case insensitive string class. It might be a bit over kill for something quick, but could provide you with a common way of handling case insensitive strings if you plan on using them often.

patrickyoung
  • 126
  • 4
  • i know this recipe well, but behind the scenes it simply have a lowercased duplicate for every string, which is no good (as explained in the trivial example I added) – Paul Oyster Sep 15 '08 at 20:37
  • The ctype solution is what I was looking for, thanks. For reference, here is the win32 code: from ctypes import * clib = cdll.LoadLibrary("msvcrt") theList = ["abc","ABC","def","DEF"] * 1000000 theList.sort(cmp = clib._stricmp) – Paul Oyster Sep 16 '08 at 08:27
  • 2
    this is much slower. see my answer! –  Sep 23 '08 at 14:11
  • 3
    I believe this gives the wrong answer for strings with nulls in them. – Darius Bacon Oct 29 '08 at 16:28
  • 6
    No, this is all wrong. The only correct solution compares their Unicode casefolds. Otherwise you will screw up. – tchrist Aug 11 '11 at 16:37
  • @tchrist as a side note, the string 'unicode' is not in the question. Nevertheless, your comment that seem to point anyone coming here with a hope for unicode in the right direction are invaluable. I attempt to edit the question title to more accurately reflect the question. – n611x007 Apr 15 '14 at 12:09