14

EDIT 2:

Confirmed that my performance problems were due to the static function call to the StringExtensions class. Once removed, the IndexOf method is indeed the fastest way of accomplishing this.

What is the fastest, case insensitive, way to see if a string contains another string in C#? I see the accepted solution for the post here at Case insensitive 'Contains(string)' but I have done some preliminary benchmarking and it seems that using that method results in orders of magnitude slower calls on larger strings (> 100 characters) whenever the test string cannot be found.

Here are the methods I know of:

IndexOf:

public static bool Contains(this string source, string toCheck, StringComparison comp)
{
    if (string.IsNullOrEmpty(toCheck) || string.IsNullOrEmpty(source))
        return false;

    return source.IndexOf(toCheck, comp) >= 0;
} 

ToUpper:

source.ToUpper().Contains(toCheck.ToUpper());

Regex:

bool contains = Regex.Match("StRiNG to search", "string", RegexOptions.IgnoreCase).Success;

So my question is, which really is the fastest way on average and why so?

EDIT:

Here is my simple test app I used to highlight the performance difference. Using this, I see 16 ms for ToLower(), 18 ms for ToUpper and 140 ms for the StringExtensions.Contains():

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;

namespace ScratchConsole
{
    class Program
    {
    static void Main(string[] args)
    {
        string input = "";
        while (input != "exit")
        {
            RunTest();
            input = Console.ReadLine();
        }
    }

    static void RunTest()
    {
        List<string> s = new List<string>();
        string containsString = "1";
        bool found;
        DateTime now;
        for (int i = 0; i < 50000; i++)
        {
            s.Add("AAAAAAAAAAAAAAAA AAAAAAAAAAAA");
        }

        now = DateTime.Now;
        foreach (string st in s)
        {
            found = st.ToLower().Contains(containsString);
        }
        Console.WriteLine("ToLower(): " + (DateTime.Now - now).TotalMilliseconds);

        now = DateTime.Now;
        foreach (string st in s)
        {
            found = st.ToUpper().Contains(containsString);
        }
        Console.WriteLine("ToUpper(): " + (DateTime.Now - now).TotalMilliseconds);


        now = DateTime.Now;
        foreach (string st in s)
        {
            found = StringExtensions.Contains(st, containsString, StringComparison.OrdinalIgnoreCase);
        }
        Console.WriteLine("StringExtensions.Contains(): " + (DateTime.Now - now).TotalMilliseconds);

    }
}

public static class StringExtensions
{
    public static bool Contains(this string source, string toCheck, StringComparison comp)
    {
        return source.IndexOf(toCheck, comp) >= 0;
    }
}

}

Community
  • 1
  • 1
hspain
  • 17,528
  • 5
  • 19
  • 31
  • 7
    Is this a real performance bottleneck for you? – Oded Oct 13 '11 at 20:16
  • 2
    Were you benchmarking in release mode with optimizations after a jitter warmup and no debugger attached and over enough iterations to substantively prove a difference? – Anthony Pegram Oct 13 '11 at 20:18
  • 3
    You need to be careful about which locale you're using. Comparison rules differ from culture to culture. For example `STRING` and `string` don't match when using Turkish culture. – CodesInChaos Oct 13 '11 at 20:19
  • `String.Empty.Contains(String.Empty); // true` – Marc Oct 13 '11 at 20:19
  • 2
    Please show the benchmarking code. And note that calling `ToUpper` will result in some pretty unexpected behaviour in some cultures. – Jon Skeet Oct 13 '11 at 20:19
  • It was a bottleneck before some refactoring to eliminate the test. The code was doing a 2000+ Contains() requests on strings of over 1000 characters at times. Delays were sub second, but being a part of the UI, any delay was too much. While the problem was mitigated, I'd still like to know the fastest way for future reference. – hspain Oct 13 '11 at 20:19
  • The benchmarks were done in release mode in batches of 10,000 tests spread over 10 runs. Like I said though, they were just preliminary and could definitely be wrong, which is why I pose the question here for expert guidance. I'll post some of my code here shortly to show what I'm working with. It's really not much more than what I posted in the question however. – hspain Oct 13 '11 at 20:21
  • I assume 2) becomes slow for long strings due to the allocation landing on the LOH. – CodesInChaos Oct 13 '11 at 20:23
  • Your RegEx is straight out of the question you referenced. It is important to note a problem cHao identified therein: If the string you are searching for contains punctuation or characters that make an invalid RegEx, it will blow up. The RegEx solution is NOT a good general solution. – aaaa bbbb Oct 13 '11 at 20:24
  • Also I believe that 1) and 2) are not equivalent even when they use the same culture. Unicode is that strange. – CodesInChaos Oct 13 '11 at 20:28
  • @JonSkeet I thought that ToUpper was a relatively culture safe way of comparing, while ToLower can cause some problems with certain cultures. It is good to know that this is not the case and to beware of both. – hspain Oct 13 '11 at 20:31
  • @hspain: No, both have problems - particularly in Turkey. – Jon Skeet Oct 13 '11 at 20:35
  • 3
    @hspain: For a start you're not benchmarking for long enough, and for a second think you're using `DateTime.Now` which isn't a high-resolution timer. You should use `Stopwatch` instead. – Jon Skeet Oct 13 '11 at 20:45
  • @JonSkeet Much appreciated for the guidance! So I changed it to use a Stopwatch instead of DateTime and ran it over 5 million strings instead of 50,000. My results are relatively the same, however. 1.616 seconds for ToLower(), 1.497 seconds for ToUpper() and 14.217 seconds for StringExtensions.Contains(). – hspain Oct 13 '11 at 20:51
  • 4
    These guys are right -- you need to be very careful about deciding up front whose case-sensitivity rules you are using. **Get the code correct before you get it fast.** – Eric Lippert Oct 13 '11 at 20:53
  • @EricLippert I understand now that ToLower and ToUpper will get you false results in certain cases. Right now, I'm just want to find out why some methods are slower than others out of curiosity. – hspain Oct 13 '11 at 20:55
  • The performance is the same of the order in which they're written. – Matthew Oct 13 '11 at 20:15

3 Answers3

17

Since ToUpper would actually result in a new string being created, StringComparison.OrdinalIgnoreCase would be faster, also, regex has a lot of overhead for a simple compare like this. That said, String.IndexOf(String, StringComparison.OrdinalIgnoreCase) should be the fastest, since it does not involve creating new strings.

I would guess (there I go again) that RegEx has the better worst case because of how it evaluates the string, IndexOf will always do a linear search, I'm guessing (and again) that RegEx is using something a little better. RegEx should also have a best case which would likely be close, though not as good, as IndexOf (due to additional complexity in it's language).

15,000 length string, 10,000 loop

00:00:00.0156251 IndexOf-OrdinalIgnoreCase
00:00:00.1093757 RegEx-IgnoreCase 
00:00:00.9531311 IndexOf-ToUpper 
00:00:00.9531311 IndexOf-ToLower

Placement in the string also makes a huge difference:

At start:
00:00:00.6250040 Match
00:00:00.0156251 IndexOf
00:00:00.9687562 ToUpper
00:00:01.0000064 ToLower

At End:
00:00:00.5781287 Match
00:00:01.0468817 IndexOf
00:00:01.4062590 ToUpper
00:00:01.4218841 ToLower

Not Found:
00:00:00.5625036 Match
00:00:01.0000064 IndexOf
00:00:01.3750088 ToUpper
00:00:01.3906339 ToLower
Jake Berger
  • 5,237
  • 1
  • 28
  • 22
aepheus
  • 7,827
  • 7
  • 36
  • 51
  • 3
    I suppose I should have said "I am 99% sure but will not take the time to do profiling when the person asking the question should". – aepheus Oct 13 '11 at 20:23
  • I linked to that same post in my question. Also, I have noticed that there is a significant difference in performance, I'm looking for answers as to why. – hspain Oct 13 '11 at 20:25
  • Appreciate you posting your numbers as well. I get similar numbers when the string is found. However, do you get similar results when the string is not found? This is where I get the performance number inversion. – hspain Oct 13 '11 at 20:52
  • Not found would always be worst case scenario, will run and post those as well. – aepheus Oct 13 '11 at 20:54
  • I see that even in the bad/worse case scenarios, the IndexOf method still gets the best performance for you. I'm trying hard to figure out why that is not the case for my benchmark code. What type of StringComparison are you passing into the IndexOf? I am using StringComparison.OrdinalIgnoreCase. – hspain Oct 13 '11 at 21:09
  • aReallyLongString.IndexOf(aShortPart,StringComparison.OrdinalIgnoreCase); your benchmark example would also have additional overhead of the function which wraps it, the assignment of the boolean, and the comparison, I am simply calling the functions and not converting results into a boolean value or assigning that value to anything. – aepheus Oct 13 '11 at 21:12
  • @aepheus Thanks, you got it. It was the static function call that was doing it. I should have realized that. Problem solved. – hspain Oct 13 '11 at 21:22
1

I have found that a compiled RegEx is by far the fastest solution and is obviously much more versatile. Compiling it helps put it on par with smaller string comparisons and as you stated, there is no comparison with larger strings.

http://www.dijksterhuis.org/regular-expressions-advanced/ contains some hints to gain maximum speed from RegEx comparisons; you might find it helpful.

Joe Mancuso
  • 2,099
  • 1
  • 16
  • 17
  • I've found the same thing. On a 500 MB dataset, string.IndexOf(Ordinal.IgnoreCase) was taking 12 seconds, but the equivalent compiled Regex was taking between 2 and 4 seconds (depending on the length of the comparison string). – Bryce Wagner Jul 27 '12 at 16:00
0

This was interesting question to me, so I have created a little test using different methods

string content = "";
            for (var i = 0; i < 10000; i++)
                content = String.Format("{0} asldkfjalskdfjlaskdfjalskdfj laksdf lkwiuirh 9238 r9849r8 49834", content);

            string test = String.Format("{0} find_me {0}", content);

            string search = test;

            var tickStart = DateTime.Now.Ticks;
            //6ms
            //var b = search.ToUpper().Contains("find_me".ToUpper());

            //2ms
            //Match m = Regex.Match(search, "find_me", RegexOptions.IgnoreCase);


            //a little bit over 1ms
            var c = false;
            if (search.Length == search.ToUpper().Replace("find_me".ToUpper(), "x").Length)
                c = true;
            var tickEnd = DateTime.Now.Ticks;
            Debug.Write(String.Format("{0} {1}", tickStart, tickEnd));

So what I ahve done is created a string and searched in it

first method search.ToUpper().Contains("find_me".ToUpper()) 5ms

second method Match m = Regex.Match(search, "find_me", RegexOptions.IgnoreCase) 2ms

third method

if (search.Length == search.ToUpper().Replace("find_me".ToUpper(), "x").Length)
                c = true;

it took little more than 1ms

Senad Meškin
  • 13,597
  • 4
  • 37
  • 55