Efficient searching for case insensitive ascii substring

Question

I was actually optimizing Regex transformations of large strings. As some calls to Regex.Replace took significant time I inserted conditional calls - something along the lines

if (str.IndexOf("abc", 0, StringComparison.CurrentCultureIgnoreCase) > 0)
    str1 = Regex.Replace(str, "abc...", string.Empty, RegexOptions.IgnoreCase);

To my surprise the results were not convincing. Sometimes better, sometimes not. Or even worse. So I measured performance of case insensitive IndexOf (I tried also StringComparison.OrdinalIgnoreCase) and found that it may be slower than Regex.Match, i.e. this test:

if( Regex.Match(str,"abc", RegexOptions.IgnoreCase).Success )...

Particularly for a large string that did not match (ascii) string "abc" I found that Regex.Match was 4x faster.

Even this procedure is faster than IndexOf:

string s1 = str.ToLower();
if( s1.IndexOf("abc") ) ...

Anybody knows a better solution?

Why do you first check if the string exists and than do replace? — Magnus, Oct 20 '11 at 16:11
Regex.Replace is a very costly operation and it makes sense to do an easy test first - mainly in cases when there is a significant probability that Regex won't find any match at all. — Jan Slodicka, Oct 20 '11 at 16:16
In the mean time I coded a quasi-solution based on ToLower transformation. Already this helps substantially because there are repeated Regex calls that could be conditionally tested this way. — Jan Slodicka, Oct 20 '11 at 16:18

Alexei Levenkov · Accepted Answer · 2011-10-20T17:14:10.347

Because indexOf is O(n*m) where RegEx will likley be O(n+m) (where n= string length, m= search string length).

If you are serious ablut searching substrings it would be useful to read about string search algorithms to be at least aware of expected speeds (start with http://en.wikipedia.org/wiki/Substring_search ).

Note: Culture sensetive comparison may be significantly slower than Ordinal (depending on your scenario you may not be able to use Ordinal versions).

As with any performance questions one need to get out and measure. In indexOf with Regex.isMatch clear winner for me is Regex. I did expected the behavior as indexOf should not perform any pre-compile of the search string and have to use O(n+m) algorithm, while Regex have to use much more optimal implemetation.

Try to measure following searches - I get almost 5x difference in favor of Regex for 100K operations.

   static void Main(string[] args)
    {
        var stringToSearch = "AAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAAbb";
        Regex regExp = new Regex(stringToSearch, RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);

        var sourceText = "AAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAbAAAAAAAAAAAAAAAAAAAAAbb";

       const int Iterations = 100000;

       Stopwatch watch = new Stopwatch();
       watch.Start();
       for (int i = 0; i < Iterations; i++)
       {
           regExp.IsMatch(sourceText);
       }
       watch.Stop();
       Console.WriteLine("RegExp: {0} ms", watch.ElapsedMilliseconds);

       watch = new Stopwatch();
       watch.Start();
       for (int i = 0; i < Iterations; i++)
       {
           sourceText.IndexOf(stringToSearch, StringComparison.OrdinalIgnoreCase);
       }
       watch.Stop();
       Console.WriteLine("RegExp: {0} ms", watch.ElapsedMilliseconds);
       Console.ReadLine();
    }

Does it mean .Net does not offer simple efficient case insensitive substring search? Hard to believe that. — Jan Slodicka, Oct 20 '11 at 16:25
RegEx does perform better for repeated searches of long strings. — Alexei Levenkov, Oct 20 '11 at 17:11
That's probably my case - searching say 10K string for a small substring of roughly 5 characters. Let me just say that I measured the difference 5:1. I'll continue tomorrow. — Jan Slodicka, Oct 20 '11 at 17:25
Just additional note - make sure you measure for your actual strings, my sample is using the string to search that is likley perform bad with simple search and good with preprocessing algoritms. — Alexei Levenkov, Oct 20 '11 at 18:47
One last note: I never considered string search a simple problem. But special case of looking for ascii substring is without any doubt the most frequently used case and as such would deserve a special treatment. Originally I took this for granted and expected that functions such as String.IndexOf will implement needed optimizations. — Jan Slodicka, Oct 21 '11 at 10:41

Jan Slodicka · Answer 2 · 2011-10-21T12:44:29.497

At the end I wrote my own function for match testing. Here it is:

private static bool HasAsciiSubstring(string str, string sub)
{
    char[] ss = sub.ToCharArray();

    // Similar stupid small optimizations bring 30% speeding-up.
    int ss_len = ss.Length;
    for (int j = 0; j < ss_len; ++j)
        ss[j] = (char)((byte)ss[j] | 32);

    byte ss0 = (byte)ss[0];
    int len = str.Length - ss_len;
    for (int i = 0; i < len; ++i)
    {
        if ((str[i] | 32) == ss0)
        {
            bool bOk = true;
            for (int j = 1; j < ss_len; ++j)
            {
                if ((str[i + j] | 32) != ss[j])
                {
                    bOk = false;
                    break;
                }
            }
            if (bOk)
                return true;
        }
    }

    return false;
}

This method delivered best results (for selected large strings) on WP7 platform. It was

2x faster than Regex.Match(str, sub, RegexOptions.IgnoreCase)
roughly 10x faster than str.IndexOf(sub, 0, StringComparison.CurrentCultureIgnoreCase)
str.IndexOf(sub, 0, StringComparison.OrdinalIgnoreCase) was another 30% slower than CurrentCultureIgnoreCase.

I need a code that will perfom satisfactorily on WP7, MonoDroid/Touch and possibly on the desktop. Hence I wrote a small WPF app to test at least the desktop:

Regex.Match() 2x faster than HasAsciiSubstring().
OrdinalIgnoreCase 2.5x slower than CurrentCultureIgnoreCase
The rest is similar

I am relatively new to C#/.Net (2 years experience only), but programmed dozens of years on other platforms (mostly C/C++). I have to bring down that this was a kind of shocking experience for me:

Inefficient C# string functions
Inefficient C# as a language tool. (I am pretty sure that the same code written in C++ would perform much better.)
This complex culture business and the fact that CurrentCultureIgnoreCase performed much better than OrdinalIgnoreCase. (Looks like every experienced C# programmer - including those around me - claimed just the opposite.)

Definitely. Using pointer arithmetic must be a huge time saver. If I do so, I'll report the results. — Jan Slodicka, Oct 21 '11 at 14:01
I gave it a quick try, but was disappointed by 2 things: a) To get the char pointer you need to call str.ToCharArray, which is a relatively expensive call. b) You can't do true pointer arithmetic - i mean expressions such as *ptr++ are illegal. Maybe I missed something, but right now I would say that unsafe code won't help too much. — Jan Slodicka, Oct 21 '11 at 15:37
Check out [this](http://stackoverflow.com/questions/228038/best-way-to-reverse-a-string-in-c-sharp-2-0/228376#228376) example of an unsafe string manipulation function. — Magnus, Oct 21 '11 at 22:07
Thanks, I did not realize that it is possible to define unfixed pointer inside unsafe block. After I did this, i got small speed improvement - less than expected. Probably there are larger reserves as ptr arithmentics allows to write more efficient algorithm. Unfortunately I found that the unsafe code is useless for me as SVL does not allow it. Thanks again for your help. — Jan Slodicka, Oct 24 '11 at 09:37

Efficient searching for case insensitive ascii substring

2 Answers2