5

I noticed a few questions about finding the nth occurrence of a character in a string. Since I was curious (and have several uses for this in an application, but mainly out of curiosity), I coded and benchmarked two of these methods in Visual Studio 2010, and I'm wondering why method 1 (FindNthOccurrence) is much slower than method 2 (IndexOfNth). The only reasons I could think of were:

  1. A problem with my benchmarking code
  2. A problem with my algorithm(s)
  3. The fact that indexOf is a in-built .NET method and is therefore already optimised

I'm leaning toward #2, but I'd still like to know. This is the relevant code.

Code

class Program
    {
        static void Main(string[] args)
        {
            char searchChar = 'a';
            Random r = new Random(UnixTimestamp());

            // Generate sample data
            int numSearches = 100000, inputLength = 100;
            List<String> inputs = new List<String>(numSearches);
            List<int> nth = new List<int>(numSearches);
            List<int> occurrences = new List<int>(numSearches);
            for (int i = 0; i < numSearches; i++)
            {
                inputs.Add(GenerateRandomString(inputLength, "abcdefghijklmnopqrstuvwxyz"));
                nth.Add(r.Next(1, 4));
            }

            // Timing of FindNthOccurrence
            Stopwatch timeFindNth = Stopwatch.StartNew();
            for (int i = 0; i < numSearches; i++)
                occurrences.Add(FindNthOccurrence(inputs[i], searchChar, nth[i]));
            timeFindNth.Stop();

            Console.WriteLine(String.Format("FindNthOccurrence: {0} / {1}",
                                            timeFindNth.ElapsedMilliseconds, timeFindNth.ElapsedTicks));

            // Cleanup
            occurrences.Clear();

            // Timing of IndexOfNth
            Stopwatch timeIndexOf = Stopwatch.StartNew();
            for (int i = 0; i < numSearches; i++)
                occurrences.Add(IndexOfNth(inputs[i], searchChar, nth[i]));
            timeIndexOf.Stop();
            Console.WriteLine(String.Format("IndexOfNth: {0} / {1}",
                                            timeIndexOf.ElapsedMilliseconds, timeIndexOf.ElapsedTicks));

            Console.Read();
        }

        static int FindNthOccurrence(String input, char c, int n)
        {
            int len = input.Length;
            int occurrences = 0;
            for (int i = 0; i < len; i++)
            {
                if (input[i] == c)
                {
                    occurrences++;
                    if (occurrences == n)
                        return i;
                }
            }
            return -1;
        }

        static int IndexOfNth(String input, char c, int n)
        {
            int occurrence = 0;
            int pos = input.IndexOf(c, 0);
            while (pos != -1)
            {
                occurrence++;
                if (occurrence == n)
                    return pos;
                pos = input.IndexOf(c, pos + 1);
            }
            return -1;
        }

            // Helper methods
        static String GenerateRandomString(int length, String legalCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")
        {
            if (length < 0) throw new ArgumentOutOfRangeException("length", "length cannot be less than zero.");
            if (string.IsNullOrEmpty(legalCharacters))
                throw new ArgumentException("allowedChars may not be empty.");

            const int byteSize = 0x100;
            var legalCharSet = new HashSet<char>(legalCharacters).ToArray();
            if (byteSize < legalCharSet.Length)
                throw new ArgumentException(String.Format("allowedChars may contain no more than {0} characters.", byteSize));

            // Guid.NewGuid and System.Random are not particularly random. By using a
            // cryptographically-secure random number generator, the caller is always
            // protected, regardless of use.
            using (var rng = new System.Security.Cryptography.RNGCryptoServiceProvider())
            {
                StringBuilder result = new StringBuilder();
                var buf = new byte[128];
                while (result.Length < length)
                {
                    rng.GetBytes(buf);
                    for (var i = 0; i < buf.Length && result.Length < length; ++i)
                    {
                        // Divide the byte into legalCharSet-sized groups. If the
                        // random value falls into the last group and the last group is
                        // too small to choose from the entire legalCharSet, ignore
                        // the value in order to avoid biasing the result.
                        var outOfRangeStart = byteSize - (byteSize % legalCharSet.Length);
                        if (outOfRangeStart <= buf[i]) continue;
                        result.Append(legalCharSet[buf[i] % legalCharSet.Length]);
                    }
                }
                return result.ToString();
            }
        }

        static int UnixTimestamp()
        {
            TimeSpan ts = (System.DateTime.UtcNow - new System.DateTime(1970, 1, 1, 0, 0, 0));
            return (int)ts.TotalSeconds;
        }
    }

Sample Output

Every result outputs times similar to this (milliseconds / elapsed ticks):

FindNthOccurrence: 27 / 79716
IndexOfNth: 12 / 36492
Community
  • 1
  • 1
Ricardo Altamirano
  • 14,650
  • 21
  • 72
  • 105
  • There is definitly a problem with your benchmarking: You include the JIT time - run n times (in a loop ... ONE run of the exe) and discard the first 2, average the rest – Eugen Rieck Jul 24 '12 at 16:18
  • @EugenRieck Just to clarify, are you saying that for each method test, run that method `n` times in a loop, discard the first two of those runs, and average the rest? Do the same for the second method? – Ricardo Altamirano Jul 24 '12 at 16:23
  • 1
    They run in about the same time on ideone ([link](http://ideone.com/m59EJ)). – Sergey Kalinichenko Jul 24 '12 at 16:26
  • I get average time of roughly `63 / 909044` and `27 / 389759`. – mellamokb Jul 24 '12 at 16:31
  • Would it make more sense to run the functions several times *outside* the loop before, to get the initialisation out of the way and therefore get a better, albeit rough, estimate, or should I still refactor the code to average the execution times? – Ricardo Altamirano Jul 24 '12 at 16:33
  • 2
    Browsing through the source code of `System.String` with Reflector, it appears an internal `IndexOf` method is called which is defined as `public extern int IndexOf(char value, int startIndex, int count);`. So it is calling out to some internal unmanaged code. – mellamokb Jul 24 '12 at 16:36
  • @mellamokb The unmanaged code provides the speed boost, I take it? – Ricardo Altamirano Jul 24 '12 at 18:31
  • @RicardoAltamirano: That would be my guess, yes. – mellamokb Jul 24 '12 at 18:33
  • @mellamokb If you want, post it as an answer and I'll accept it. I'd appreciate any feedback on the benchmarking code too, even though I realise that the question itself was directly asking about the difference in speed. – Ricardo Altamirano Jul 24 '12 at 18:34

2 Answers2

1

I'm sure you run a debug build. switch to a release build. both methods take about the same time.

Arne
  • 2,106
  • 12
  • 9
  • 1
    When I switch to release build, they are *faster*, but still about the same factor difference. In Debug, I get `63` vs `27` (a factor of 2.33); In release, I get `47` vs `20` (a factor of 2.35). In LINQPad I also get `63` vs `27`. – mellamokb Jul 24 '12 at 16:40
  • In ideone (Mono) they run about the same time - see the link in @dasblinkenlight's comment to the question. I also ran the code in Visual Studio 2010: 13/12 and Visual Studio Express C# 2008: 16/15. Be sure to "start without debugging"/Ctrl-F5, or directly start the executable. I tested on two different Intel CPUs, fully updated .Net Runtime, one running Windows 7 64bit, the other Windows Vista 32bit. – Arne Jul 24 '12 at 18:43
  • BTW, I am not the question poster :) Note that the OP has a highlighted username. – mellamokb Jul 24 '12 at 18:45
  • @mellamokb ok, ok changed the comment :) – Arne Jul 24 '12 at 18:46
1

Browsing through the source code of System.String with Reflector, it appears an IndexOf method is called which is defined as:

public extern int IndexOf(char value, int startIndex, int count);

So it is calling out to some internal unmanaged code, which probably provides the observed speed boost. It is unlikely you will be able to get any faster with managed code.

mellamokb
  • 56,094
  • 12
  • 110
  • 136
  • Thanks for the help. I need to get myself a copy of Reflector. It would make simple curiosities like this a lot easier for me to answer myself. – Ricardo Altamirano Jul 24 '12 at 19:09