102

When doing case-insensitive comparisons, is it more efficient to convert the string to upper case or lower case? Does it even matter?

It is suggested in this SO post that C# is more efficient with ToUpper because "Microsoft optimized it that way." But I've also read this argument that converting ToLower vs. ToUpper depends on what your strings contain more of, and that typically strings contain more lower case characters which makes ToLower more efficient.

In particular, I would like to know:

  • Is there a way to optimize ToUpper or ToLower such that one is faster than the other?
  • Is it faster to do a case-insensitive comparison between upper or lower case strings, and why?
  • Are there any programming environments (eg. C, C#, Python, whatever) where one case is clearly better than the other, and why?
Community
  • 1
  • 1
Parappa
  • 7,566
  • 3
  • 34
  • 38

9 Answers9

100

Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.

MSDN has some great guidelines on string handling. You might also want to check that your code passes the Turkey test.

EDIT: Note Neil's comment around ordinal case-insensitive comparisons. This whole realm is pretty murky :(

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 20
    Yes StringComparer is great, but the question wasn't answered... In situations where you can't use StringComparer such as a swtich statement against a string; should I ToUpper or ToLower in the switch? – joshperry Feb 22 '09 at 19:51
  • 7
    Use a StringComparer and "if"/"else" instead of using either ToUpper or ToLower. – Jon Skeet Feb 22 '09 at 20:50
  • 7
    John, I know that converting to lower case is incorrect, but I had not heard that converting to uppercase is incorrect. Can you offer an example or a reference? The MSDN article you linked to says this: "Comparisons made using OrdinalIgnoreCase are behaviorally the composition of two calls: calling ToUpperInvariant on both string arguments, and doing an Ordinal comparison." In the section titled "Ordinal String Operations", it restates this in code. – Neil Mar 18 '11 at 15:59
  • That said, I almost always prefer the StringComparer options for performance reasons. – Neil Mar 18 '11 at 16:00
  • 2
    @Neil: Interesting, I hadn't seen that bit. For an *ordinal* case-insensitive comparison, I guess that's fair enough. It's got to pick *something*, after all. For culturally-sensitive case-insensitive comparisons, I think there'd still be room for some odd behaviour. Will point out your comment in the answer... – Jon Skeet Mar 18 '11 at 16:02
  • Thanks for the quick response, John. – Neil Mar 18 '11 at 16:05
  • 2
    While this answer gives the author a solution, it also dodges the question. The author is seeking information to make an informed choice between ToUpper and ToLower for performance concerns. While offering an alternative 3rd choice (StringComparer) is legitimate, it should at least be framed in context of its performance (not just its correctness) relative to the other two choices. There's no mention of performance in this answer. It would be better if it included something like "StringComparer.Compare is significantly faster (and more correct), even when comparing exclusively ASCII text." – Triynko Sep 15 '11 at 18:12
  • 5
    @Triynko: I think it's important to concentrate *primarily* on correctness, with the point that getting the wrong answer fast is usually no better (and is sometimes worse) than getting the wrong answer slowly. – Jon Skeet Sep 15 '11 at 18:19
  • 1
    If implementing a case-insensitive hash table, you'll need to choose either upper case or lower case. – Ian Boyd Jan 02 '13 at 20:27
  • 1
    @IanBoyd: Not necessarily. For example, in .NET you'd just create a `Dictionary` with something like `StringComparer.OrdinalIgnoreCase`. You only need to be able to test for case-insensitive-equality, and get an appropriate hash code which is consistent with that. – Jon Skeet Jan 02 '13 at 20:35
  • But if you're *creating* a hash list (as i had to because the language provided none), you have to `Hash` a case-neutral version (i.e. uppercase) – Ian Boyd Jan 02 '13 at 22:38
  • @IanBoyd You don't have to convert the case of your keys if you use a hash algorithm that gives the same result when two strings only differ in their casing. Notice that the StringComparer class includes a GetHashCode() method. – Neil Oct 09 '13 at 16:26
  • 1
    @NeilWhitaker But you forget, i was talking about *creating* a case-insensitive table. For example, the original question is language agnostic. i happen to mainly develop in a language without `Dictionary` and `StringComparer`, because those are in a language different than the language that i, or the original poster, are talking about. If you were implementing a hash table, in assembly, what algorithm would you use to create case-insensitive hash codes? If you were down to choosing between uppercasing and lowercasing, the correct answer is uppercasing. – Ian Boyd Oct 10 '13 at 01:59
  • One year late, I still would like to add this piece of information for any late reader like me: One would not turn the key to upper case, but during calculation of the hash key a COPY of it. So you keep your key, and hash values will be the same if just cases differ. – Aconcagua Sep 25 '14 at 14:29
  • A long time later... it is difficult in some contexts to use a `StringComparer` e.g. a LINQ to Objects `GroupBy` with an anonymous (multi-field) key. – NetMage Feb 20 '20 at 19:39
  • @NetMage: I agree, that makes it tricky. That doesn't make it *correct* to just upper-case or lower-case though :( – Jon Skeet Feb 20 '20 at 19:45
  • @Neil his name is Jon, not John. – David Klempfner Apr 02 '21 at 10:51
  • 1
    @DavidKlempfner: Thanks for the correction. I guess I'm so used to typing "John", I didn't even think about it :) – Neil Apr 03 '21 at 15:59
  • Tell me, **what is so “interesting” about Turkish** that you find the most widespread way of doing case-insensitive comparison incorrect? – Константин Ван May 09 '21 at 01:58
  • 2
    @Константин Ван: The Turkish "i" problem - see http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html (For example, a couple of decades ago I had code that failed due `"mail".toUpperCase()` in Java not being "MAIL", when in Turkey. – Jon Skeet May 09 '21 at 06:12
  • @JonSkeet I’ve just read the article. Well, that’s, rather interestingly nightmarish, I’d say. Didn’t know that; thank you. – Константин Ван May 09 '21 at 07:17
  • @КонстантинВан: Any reason you didn't read the article from the existing link that's been in the answer for over 12 years? Please don't edit an answer just to add another copy of a link that's already there. – Jon Skeet May 09 '21 at 07:34
36

From Microsoft on MSDN:

Best Practices for Using Strings in the .NET Framework

Recommendations for String Usage

Why? From Microsoft:

Normalize strings to uppercase

There is a small group of characters that when converted to lowercase cannot make a round trip.

What is example of such a character that cannot make a round trip?

  • Start: Greek Rho Symbol (U+03f1) ϱ
  • Uppercase: Capital Greek Rho (U+03a1) Ρ
  • Lowercase: Small Greek Rho (U+03c1) ρ

ϱ , Ρ , ρ

.NET Fiddle

Original: ϱ
ToUpper: Ρ
ToLower: ρ

That is why, if your want to do case insensitive comparisons you convert the strings to uppercase, and not lowercase.

So if you have to choose one, choose Uppercase.

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • 4
    Back to the answer the original question: There are languages knowing more than one lower case variant for one upper case variant. Unless you know the rules for when to use which representation (another example in Greek: small sigma letter, you use σ at word start or in the middle, ς at the words end (see http://en.wikipedia.org/wiki/Sigma), you can't securely convert back to the lower case variant. – Aconcagua Sep 25 '14 at 14:44
  • 1
    Actually what about German 'ß', if you call ```ToUpper()``` it will turn into 'SS' on many systems. So this is actually not round-trip-able either. – Sebastian Aug 18 '16 at 02:41
  • if Microsoft has optimized the code for performing uppercase comparisons is it because the ASCII code for uppercase letters only two digits 65 - 90 while ASCII code Lowercase letters 97 -122 which contains 3 digits (need more processing) – Medo Medo Dec 20 '16 at 09:50
  • It should be noted that both "ϱ" and "ς" return themselves from `ToUpperInvariant()`, so it would still be nice to see real examples why uppercase is better than lowercase – max630 Dec 10 '18 at 11:11
  • 1
    This answer does not appear to be relevant. According to the Microsoft link, this only matters when changing the *locale* of a string: *"To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters."* But the question does not involve converting to a different locale. – ToolmakerSteve Sep 03 '19 at 21:02
  • 1
    @ToolmakerSteve Which is why we have the **best practice** to use uppercase, and not lowercase - to avoid the exact problems you mentioned. Also it is relevant, because it matters even without changing the locale of a string. – Ian Boyd Sep 04 '19 at 13:36
  • 2
    ϱ != ρ, but if you use upper case, then wouldn't it essentially change both to Ρ, and then compare Ρ == Ρ which would be true, even though ϱ != ρ? – David Klempfner Apr 02 '21 at 10:59
  • 1
    @MedoMedo No because any number 0-128 takes the same number of bits in any modern computer - computers store numbers in binary, and mostly operate on fixed-width pieces of memory. The number of digits only matters to us humans, and in the surface area between computers and humans (like when a calculator program interprets `99` vs `100`, that's one more character to parse, but after the parsing is done, it's going to be the same size of integer internally, so all operations after that are the same speed). – mtraceur May 01 '22 at 02:39
20

According to MSDN it is more efficient to pass in the strings and tell the comparison to ignore case:

String.Compare(strA, strB, StringComparison.OrdinalIgnoreCase) is equivalent to (but faster than) calling

String.Compare(ToUpperInvariant(strA), ToUpperInvariant(strB), StringComparison.Ordinal).

These comparisons are still very fast.

Of course, if you are comparing one string over and over again then this may not hold.

boflynn
  • 3,534
  • 1
  • 27
  • 28
Rob Walker
  • 46,588
  • 15
  • 99
  • 136
12

Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).

In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing UPPER is no different from lower.

If you were sneaky and loaded your strings into long arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.

Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)

warren
  • 32,620
  • 21
  • 85
  • 124
  • 12
    To answer the question why I need to know which is faster: I don't need to know, I merely want to know. :) It's simply a case of seeing somebody make a claim (such as "comparing upper case strings is faster!") and wanting to know whether it is really true and/or why they made that claim. – Parappa Oct 24 '08 at 18:06
  • 1
    that makes sense - I'm eternally curious on stuff like this, too :) – warren Oct 26 '08 at 21:11
  • 1
    With C strings, to convert `s` and `t` to arrays of longs such that the strings are equal iff the arrays are equal you have to walk down s and t until you find the terminating `'\0'` character (or else you might compare garbage past the end of the strings, which may be an illegal memory access that invokes undefined behavior). But then why not just do the comparisons while walking over the characters one by one? With C++ strings, you can probably get the length and `.c_str()`, cast to a `long *` and compare a prefix of length `.size() - .size()%(sizeof long)`. Looks a bit fishy to me, tho. – Jonas Kölker Jul 19 '17 at 11:13
  • @JonasKölker - loading the string into an array of `long`s *just for comparison's sake* would be foolish. But if you're doing it "a lot" - I could see a *possible* argument for it to be done. – warren Dec 28 '20 at 13:07
  • 1
    Please don't try to “fix” the grammar - especially by removing the apostrophe on “STL’s” that's not a plural: it's a possessive – warren Apr 06 '21 at 04:09
5

Microsoft has optimized ToUpperInvariant(), not ToUpper(). The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.

I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.

Dan Herbert
  • 99,428
  • 48
  • 189
  • 219
  • if Microsoft has optimized the code for performing uppercase comparisons is it because the ASCII code for uppercase letters only two digits 65 - 90 while ASCII code Lowercase letters 97 -122 which contains 3 digits (need more processing) ? – Medo Medo Dec 20 '16 at 09:47
  • 4
    @Medo I don't remember the exact reasons for optimization, but 2 vs 3 digits is almost certainly not the reason since all letters are stored as binary numbers, so decimal digits doesn't really have meaning based on the way they are stored. – Dan Herbert Dec 28 '16 at 17:22
4

If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.

Jon Tackabury
  • 47,710
  • 52
  • 130
  • 168
3

I wanted some actual data on this, so I pulled the full list of two byte characters that fail with ToLower or ToUpper. I then ran this test below:

using System;

class Program {
   static void Main() {
      char[][] pairs = {
new[]{'\u00E5','\u212B'},new[]{'\u00C5','\u212B'},new[]{'\u0399','\u1FBE'},
new[]{'\u03B9','\u1FBE'},new[]{'\u03B2','\u03D0'},new[]{'\u03B5','\u03F5'},
new[]{'\u03B8','\u03D1'},new[]{'\u03B8','\u03F4'},new[]{'\u03D1','\u03F4'},
new[]{'\u03B9','\u1FBE'},new[]{'\u0345','\u03B9'},new[]{'\u0345','\u1FBE'},
new[]{'\u03BA','\u03F0'},new[]{'\u00B5','\u03BC'},new[]{'\u03C0','\u03D6'},
new[]{'\u03C1','\u03F1'},new[]{'\u03C2','\u03C3'},new[]{'\u03C6','\u03D5'},
new[]{'\u03C9','\u2126'},new[]{'\u0392','\u03D0'},new[]{'\u0395','\u03F5'},
new[]{'\u03D1','\u03F4'},new[]{'\u0398','\u03D1'},new[]{'\u0398','\u03F4'},
new[]{'\u0345','\u1FBE'},new[]{'\u0345','\u0399'},new[]{'\u0399','\u1FBE'},
new[]{'\u039A','\u03F0'},new[]{'\u00B5','\u039C'},new[]{'\u03A0','\u03D6'},
new[]{'\u03A1','\u03F1'},new[]{'\u03A3','\u03C2'},new[]{'\u03A6','\u03D5'},
new[]{'\u03A9','\u2126'},new[]{'\u0398','\u03F4'},new[]{'\u03B8','\u03F4'},
new[]{'\u03B8','\u03D1'},new[]{'\u0398','\u03D1'},new[]{'\u0432','\u1C80'},
new[]{'\u0434','\u1C81'},new[]{'\u043E','\u1C82'},new[]{'\u0441','\u1C83'},
new[]{'\u0442','\u1C84'},new[]{'\u0442','\u1C85'},new[]{'\u1C84','\u1C85'},
new[]{'\u044A','\u1C86'},new[]{'\u0412','\u1C80'},new[]{'\u0414','\u1C81'},
new[]{'\u041E','\u1C82'},new[]{'\u0421','\u1C83'},new[]{'\u1C84','\u1C85'},
new[]{'\u0422','\u1C84'},new[]{'\u0422','\u1C85'},new[]{'\u042A','\u1C86'},
new[]{'\u0463','\u1C87'},new[]{'\u0462','\u1C87'}
      };
      int upper = 0, lower = 0;
      foreach (char[] pair in pairs) {
         Console.Write(
            "U+{0:X4} U+{1:X4} pass: ",
            Convert.ToInt32(pair[0]),
            Convert.ToInt32(pair[1])
         );
         if (Char.ToUpper(pair[0]) == Char.ToUpper(pair[1])) {
            Console.Write("ToUpper ");
            upper++;
         } else {
            Console.Write("        ");
         }
         if (Char.ToLower(pair[0]) == Char.ToLower(pair[1])) {
            Console.Write("ToLower");
            lower++;
         }
         Console.WriteLine();
      }
      Console.WriteLine("upper pass: {0}, lower pass: {1}", upper, lower);
   }
}

Result below. Note I also tested with the Invariant versions, and result was exact same. Interestingly, one of the pairs fails with both. But based on this ToUpper is the best option.

U+00E5 U+212B pass:         ToLower
U+00C5 U+212B pass:         ToLower
U+0399 U+1FBE pass: ToUpper
U+03B9 U+1FBE pass: ToUpper
U+03B2 U+03D0 pass: ToUpper
U+03B5 U+03F5 pass: ToUpper
U+03B8 U+03D1 pass: ToUpper
U+03B8 U+03F4 pass:         ToLower
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: ToUpper
U+0345 U+03B9 pass: ToUpper
U+0345 U+1FBE pass: ToUpper
U+03BA U+03F0 pass: ToUpper
U+00B5 U+03BC pass: ToUpper
U+03C0 U+03D6 pass: ToUpper
U+03C1 U+03F1 pass: ToUpper
U+03C2 U+03C3 pass: ToUpper
U+03C6 U+03D5 pass: ToUpper
U+03C9 U+2126 pass:         ToLower
U+0392 U+03D0 pass: ToUpper
U+0395 U+03F5 pass: ToUpper
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: ToUpper
U+0398 U+03F4 pass:         ToLower
U+0345 U+1FBE pass: ToUpper
U+0345 U+0399 pass: ToUpper
U+0399 U+1FBE pass: ToUpper
U+039A U+03F0 pass: ToUpper
U+00B5 U+039C pass: ToUpper
U+03A0 U+03D6 pass: ToUpper
U+03A1 U+03F1 pass: ToUpper
U+03A3 U+03C2 pass: ToUpper
U+03A6 U+03D5 pass: ToUpper
U+03A9 U+2126 pass:         ToLower
U+0398 U+03F4 pass:         ToLower
U+03B8 U+03F4 pass:         ToLower
U+03B8 U+03D1 pass: ToUpper
U+0398 U+03D1 pass: ToUpper
U+0432 U+1C80 pass: ToUpper
U+0434 U+1C81 pass: ToUpper
U+043E U+1C82 pass: ToUpper
U+0441 U+1C83 pass: ToUpper
U+0442 U+1C84 pass: ToUpper
U+0442 U+1C85 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+044A U+1C86 pass: ToUpper
U+0412 U+1C80 pass: ToUpper
U+0414 U+1C81 pass: ToUpper
U+041E U+1C82 pass: ToUpper
U+0421 U+1C83 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+0422 U+1C84 pass: ToUpper
U+0422 U+1C85 pass: ToUpper
U+042A U+1C86 pass: ToUpper
U+0463 U+1C87 pass: ToUpper
U+0462 U+1C87 pass: ToUpper
upper pass: 46, lower pass: 8
Zombo
  • 1
  • 62
  • 391
  • 407
0

It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
0

Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.

Clearer
  • 2,166
  • 23
  • 38