Why is my string.indexof(char) faster?

Question

Don't ask how I got there, but I was playing around with some masking, loop unrolling etc. In any case, out of interest I was thinking about how I would implement an indexof method, and long story short, all that masking etc aside, this naive implementation:

public static unsafe int IndexOf16(string s, int startIndex, char c) {
            if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
            fixed (char* cs = s) {
                for (int i = startIndex; i < s.Length; i++) {
                    if ((cs[i]) == c) return i;
                }
                return -1;
            }
        }

is faster than string.IndexOf(char). I wrote some simple tests, and it seems to match output exactly. Some sample output numbers from my machine (it varies to some degree of course, but the trend is clear):

short haystack 500k runs
1741 ms for IndexOf16
2737 ms for IndexOf32
2963 ms for IndexOf64
2337 ms for string.IndexOf <-- buildin

longer haystack:
2888 ms for IndexOf16
3028 ms for IndexOf32
2816 ms for IndexOf64
3353 ms for string.IndexOf <-- buildin

IndexOfChar is marked extern, so you cant reflector it. However I think this should be the (native) implementation: http://www.koders.com/cpp/fidAB4768BA4DF45482A7A2AA6F39DE9C272B25B8FE.aspx?s=IndexOfChar#L1000

They seem to use the same naive implementation.

Questions come to my mind:

1) Am I missing something in my implementation that explains why its faster? I can only think of extended chars support, but their implementation suggests they don't do anything special for that either.

2) I assumed much of the low level methods would ultimately be implemented in hand assembler, that seems not the case. If so, why implement it natively at all, instead of just in C# like my sample implementation?

(Complete test here (I think its too long to paste here): http://paste2.org/p/1606018 )

(No this is not premature optimization, it's not for a project I am just messing about) :-)

Update: Thnx to Oliver for the hint about nullcheck and the Count param. I have added these to my IndexOf16Implementation like so:

public static unsafe int IndexOf16(string s, int startIndex, char c, int count = -1) {
    if (s == null) throw new ArgumentNullException("s");
    if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
    if (count == -1) count = s.Length - startIndex;
    if (count < 0 || count > s.Length - startIndex) throw new ArgumentOutOfRangeException("count");

    int endIndex = startIndex + count;
    fixed (char* cs = s) {
        for (int i = startIndex; i < endIndex; i++) {
            if ((cs[i]) == c) return i;
        }
        return -1;
    }
}

The numbers changed slightly, however it is still quite significantly faster (32/64 results omitted):

short haystack 500k runs
1908 ms for IndexOf16
2361 ms for string.IndexOf
longer haystack:
3061 ms for IndexOf16
3391 ms for string.IndexOf

Update2: This version is faster yet (especially for the long haystack case):

public static unsafe int IndexOf16(string s, int startIndex, char c, int count = -1) {
            if (s == null) throw new ArgumentNullException("s");
            if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
            if (count == -1) count = s.Length - startIndex;
            if (count < 0 || count > s.Length - startIndex) throw new ArgumentOutOfRangeException("count");

            int endIndex = startIndex + count;
            fixed (char* cs = s) {
                char* cp = cs + startIndex;
                for (int i = startIndex; i <= endIndex; i++, cp++) {
                    if (*cp == c) return i;
                }
                return -1;
            }
        }

Update 4: Based on the discussion with LastCoder I believe this to be architecture depended. My Xeon W3550 at works seems to prefer this version, while his i7 seems to like the buildin version. My home machine (Athlon II) appears to be in between. I am surprised about the large difference though.

mask1 should be 0xffff instead of 0xff – hazzik Aug 24 '11 at 08:32 — hazzik, Aug 24 '11 at 08:32

Louis Ricci · Accepted Answer · 2011-08-24T15:59:57.823

Possibility 1) This may not hold (as true) in C# but when I did optimization work for x86-64 assembler I quickly found out while benchmarking that calling code from a DLL (marked external) was slower than implementing the same exact function within my executable. The most obvious reason is paging and memory, the DLL (external) method is loaded far away in memory from the rest of the running code and if it wasn't accessed previously it'll need to be paged in. Your benchmarking code should do some warm up loops of the functions you are benchmarking to make sure they are paged in memory first before you time them.

Possibility 2) Microsoft tends not to optimize string functions to the fullest, so out optimizing a native string length, substring, indexof etc. isn't really unheard of. Anecdote; in x86-64 assembler I was able to create a version of WinXP64's RtlInitUnicodeString function that ran 2x faster in almost all practical use cases.

Possibility 3) Your benchmarking code shows that you're using the 2 parameter overload for IndexOf, this function likely calls the 3 parameter overload IndexOf(Char, Int32, Int32) which adds an extra overhead to each iteration.

This may be even faster because your removing the i variable increment per iteration.

            char* cp = cs + startIndex;
            char* cpEnd = cp + endIndex;
            while (cp <= cpEnd) {
                if (*cp == c) return cp - cs;
                cp++;
            }

edit In reply regarding (2) for your curiosity, coded back in 2005 and used to patch the ntdll.dll of my WinXP64 machine. http://board.flatassembler.net/topic.php?t=4467

RtlInitUnicodeString_Opt: ;;rcx=buff rdx=ucharstr 77bytes
             xor    r9d,r9d
             test   rdx,rdx
             mov    dword[rcx],r9d
             mov    [rcx+8],rdx
             jz     .end
             mov    r8,rdx
   .scan:
             mov    eax,dword[rdx]

             test   ax,ax
             jz     .one
             add    rdx,4
             shr    eax,16
             test   ax,ax
             jz     .two
             jmp    .scan
   .two:
             add    rdx,2
   .one:
             mov    eax,0fffch
             sub    rdx,r8
             cmp    rdx,0fffeh
             cmovnb rdx,rax
             mov    [ecx],dx
             add    dx,2
             mov    [ecx+2],dx
             ret
   .end:
             retn

edit 2 Running your example code (updated with your fastest version) the string.IndexOf runs faster on my Intel i7, 4GB RAM, Win7 64bit.

short haystack 500k runs
2590 ms for IndexOf16
2287 ms for string.IndexOf
longer haystack:
3549 ms for IndexOf16
2757 ms for string.IndexOf

Optimizations are sometimes very architecture reliant.

Thnx. 1) the code should be fully (jitted and I guess paged) by the time it is benchmarked, because its tested for correctness first. But good point. 2) I'd like to see that code :-) — chrisaut, Aug 24 '11 at 13:21
3) helps the inbuild one slightly, but not enough: short haystack 500k runs 1669 ms for IndexOf16 2319 ms for string.IndexOf with 2 2094 ms for string.IndexOf with 3 longer haystack: 2289 ms for IndexOf16 3238 ms for string.IndexOf with 2 3153 ms for string.IndexOf with 3 — chrisaut, Aug 24 '11 at 13:30
Regarding your version, I had almost the exact same thing. :-) Nevertheless tried yours (u forgot a cast and the return -1). It is slower: ~2150/~2700 for short/long test Nice suggestion though, thnx — chrisaut, Aug 24 '11 at 13:34
Writing code off the top of my head into a web browser is a longstanding bad habit of mine. Have you testing unrolling your loop? possibly 2x (I'll make an edit demonstrating). — Louis Ricci, Aug 24 '11 at 14:57
Hmm, I have a Xeon W3550 (also Nehalem based) with 6GB on / x64. Weird, I'll try on my home machine. — chrisaut, Aug 24 '11 at 16:04
This is on my home machine (Athlon II, 8GB, 7 x64) short haystack 500k runs 1893 ms for IndexOf16 2439 ms for IndexOf32 2434 ms for IndexOf64 1974 ms for string.IndexOf longer haystack: 3494 ms for IndexOf16 4183 ms for IndexOf32 3732 ms for IndexOf64 2548 ms for string.IndexOf This is kind of getting confusing, at this point I think I'll accept your answer, as being simply "architecture dependend". Thnx for the interesting discussion btw! — chrisaut, Aug 24 '11 at 16:31

score 2 · Answer 2 · answered Aug 24 '11 at 08:42

If you really make such a micro measurement check every single bit counts. Within the MS implementation (as seen in the link you provided) they also check if s is null and throw a NullArgumentException. Also this is the implementation including the count parameter. So they additionally check if count as a correct value and throw a ArgumentOutOfRangeException.

I think these little checks to make the code more robust are enough to make them a little bit slower if you call them so often in such a short time.

Thnx for the hint and good point. I have added string null check and a count implementation. It does not significantly change the numbers. See my updated question. — chrisaut, Aug 24 '11 at 09:41

score 1 · Answer 3 · answered Aug 24 '11 at 08:47

1

This might have somthing to do with the "fixed" statement as "It pins the location of the src and dst objects in memory so that they will not be moved by garbage collection." perhaps speeding up the methods?

Also "Unsafe code increases the performance by getting rid of array bounds checks." this could also be why.

Above comments taken from MSDN

answered Aug 24 '11 at 08:47

Jethro

5,896
3
23
24

If anything fixed should make it slower compared to their native implementation as they probably have other ways to communicate with the gc. And of course since they are native, bounds check should not play a role. – chrisaut Aug 24 '11 at 09:39
@Steven, Could the answer not be "Unlike using dynamic array allocation, a fixed length array will be considered as part of the struct instead of just a reference. It also has the added bonus of being an unmanaged type and so a struct which uses it will be stack allocated by default. " So because your unmanaged code is running on the stack that it's faster then some managed code which might run on heap? – Jethro Aug 24 '11 at 09:59
I'm not quite sure what you mean. I'm not allocating any arrays, am I? What do you mean "running on the stack". AFAIK stack vs heap only plays a role in mem allocation. I assume neither implementation makes any heap allocations. Am I missing something? – chrisaut Aug 24 '11 at 10:05
@Steven, Take a look at this http://www.atalasoft.com/cs/blogs/rickm/archive/2008/04/15/improving-performance-through-stack-allocation-net-memory-management-part-2.aspx link. This post descibes the stack and heap. http://stackoverflow.com/questions/79923/what-and-where-are-the-stack-and-heap – Jethro Aug 24 '11 at 10:13
Uhmm, thnx. I know about stack and heap. Can you explain a bit why you think this matters here (in context)? – chrisaut Aug 24 '11 at 10:19
@Steven, well I'm thinking that strings get stored on the Heap, were as unmanaged pointers get stored on the Stack, and since the Stack is quicker thats why your unmanage code is quicker. Would you agree? – Jethro Aug 24 '11 at 10:35
But my version is not the unmanaged one. Theirs is. Also, if you look at their implementation, it does RefInterpretGetStringValuesDangerousForGC which also gets them a pointer to a char array, just like fixed does. During the loop, both versions have to resolve the pointer with an offset, which will point to the heap. So I dont think this plays a role here. – chrisaut Aug 24 '11 at 11:38

Why is my string.indexof(char) faster?

3 Answers3