1

I have two text boxes, one for the input and another for the output. I need to filter only Hexadecimals characters from input and output it in uppercase. I have checked that using Regular Expressions (Regex) is much faster than using loop.

My current code to uppercase first then filter the Hex digits as follow:

string strOut = Regex.Replace(inputTextBox.Text.ToUpper(), "[^0-9^A-F]", "");
outputTextBox.Text = strOut;

An alternatively:

string strOut = Regex.Replace(inputTextBox.Text, "[^0-9^A-F^a-f]", "");
outputTextBox.Text = strOut.ToUpper();

The input may contain up to 32k characters, therefore speed is important here. I have used TimeSpan to measure but the results are not consistent.

My question is: which code has better speed performance and why?

David
  • 3,957
  • 2
  • 28
  • 52
  • Are you _sure_ that this isn't a case of premature optimization? Do you expect the text box to be populated with 32K characters more than 1000 times per second? – HABO May 29 '13 at 01:01

3 Answers3

3

This is definitely a case of premature optimization: 32K characters is not a big deal for finely tuned regex engines running on modern computers, so this optimization task is mostly theoretical.

Before discussing the performance, it's worth pointing out that the expressions are probably not doing what you want, because they allow ^ characters into the output. You need to use [^0-9A-F] and [^0-9A-Fa-f] instead.

The speed of the two regexes will be identical, because the number of characters in a character class hardly makes a difference. However, the second combination ToUpper call will be called on a potentially shorter string, because all invalid characters will be removed. Therefore, the second option is potentially slightly faster.

However, if you must optimize this to the last CPU cycle, you can rewrite this without regular expressions to avoid a memory allocation in the ToUpper: walk through the input string in a loop, and add all valid characters to StringBuilder as you go. When you see a lowercase character, convert it to upper case.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • "The number of characters in a character class hardly makes a difference" please read http://stackoverflow.com/questions/16621738/d-is-less-efficient-than-0-9, the whole reason `\d` is slower than `[0-9]` is because it includes more characters. – AJMansfield May 29 '13 at 01:17
  • @AJMansfield That's not because `\d` simply has more characters, but because the code points are spread out too much. In this case, however, the characters are near each other, so the additional check of lowercase characters would not take much longer. – Sergey Kalinichenko May 29 '13 at 01:19
  • OK, I see what you mean. Also, if you _really_ need to optimize it, you should be optimizing it for a GPU, not a standard CPU. – AJMansfield May 29 '13 at 01:25
1

It's simple to test:

using System;
using System.Diagnostics;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string letters = "abcdefghijklmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ";
            Random random = new Random();
            string[] strings = Enumerable.Range(0, 5000).Select(i1 => string.Join("", Enumerable.Range(0,32000).Select(i2 => letters[random.Next(0, letters.Length - 1)]))).ToArray();

            Stopwatch stopwatchA = new Stopwatch();
            stopwatchA.Start();

            foreach (string s in strings)
                Regex.Replace(s.ToUpper(), "[^0-9^A-F]", "");

            stopwatchA.Stop();

            Stopwatch stopwatchB = new Stopwatch();
            stopwatchB.Start();

            foreach (string s in strings)
                Regex.Replace(s, "[^0-9^A-F^a-f]", "").ToUpper();

            stopwatchB.Stop();

            Debug.WriteLine("stopwatchA: {0}", stopwatchA.Elapsed);
            Debug.WriteLine("stopwatchB: {0}", stopwatchB.Elapsed);
        }
    }
}

Run 1:

stopwatchA: 00:00:39.6552012

stopwatchB: 00:00:40.6757048

Run 2:

stopwatchA: 00:00:39.7022437

stopwatchB: 00:00:41.3477625

In those to runs, the first approach is faster.

lightbricko
  • 2,649
  • 15
  • 21
  • This is not a good benchmark, because nothing gets thrown away: both expressions pass everything through. You need to add a fair number of non-letters into the mix, and see how the two expressions compare after that. – Sergey Kalinichenko May 29 '13 at 01:22
  • dasblinkenlight: The expressions only pass these through to get replaced by an empty string: ghijklmnopqrstuvxyzGHIJKLMNOPQRSTUVXYZ – lightbricko May 29 '13 at 01:56
0

On the theoretical size, string.ToUpper() can potentially allocate a new string (remember that .NET strings are semantically immutable), but the regular expression on the other hand conserves memory, i.e. it should be faster in the general case (for large strings). Also the input string will be looped through twice if using the toUpper() call.

Sebastian
  • 6,293
  • 6
  • 34
  • 47