0

Is the regex code in VB.net known to be slow?

I took over some code that was cleaning large amounts of text data. The code ran fairly slow, so I was looking for some ways to speed it up. I found a couple functions that got run a lot that I thought might be part of the problem.

Here's the original code for cleaning a phone number:

        Dim strArray() As Char = strPhoneNum.ToCharArray
        Dim strNewPhone As String = ""
        Dim i As Integer

        For i = 0 To strArray.Length - 1
            If strArray.Length = 11 And strArray(0) = "1" And i = 0 Then
                Continue For
            End If

            If IsNumeric(strArray(i)) Then
                strNewPhone = strNewPhone & strArray(i)
            End If
        Next

        If Len(strNewPhone) = 7 Or Len(strNewPhone) = 10 Then
            Return strNewPhone
        End If

I rewrote the code to eliminate the array and looping using regex.

        Dim strNewPhone As String = ""
        strNewPhone = Regex.Replace(strPhoneNum, "\D", "")
        If strNewPhone = "" OrElse strNewPhone.Substring(0, 1) <> "1" Then
            Return strNewPhone
        Else
            strNewPhone = Mid(strNewPhone, 2)
        End If

        If Len(strNewPhone) = 7 Or Len(strNewPhone) = 10 Then
            Return strNewPhone
        End If

After running a couple tests, the new code is significantly slower than the old. Is regex in VB.net slow, did I add some other thing that is the issue, or is the original code just fine the way it was?

KevenDenen
  • 1,718
  • 13
  • 25
  • No, its just as fast as all other .NET languages. Make sure you are actually measuring correctly, and measuring the thing you think you are measuring, and that it even matters to measure the thing you are measuring. – StingyJack Aug 19 '13 at 17:24
  • More info here http://stackoverflow.com/questions/457605/how-to-measure-code-performance-in-net – StingyJack Aug 19 '13 at 17:25
  • and here... http://c2.com/cgi/wiki?PrematureOptimization – StingyJack Aug 19 '13 at 17:27
  • Ah, here is the post I was looking for. http://stackoverflow.com/q/6396056/16391 – StingyJack Aug 19 '13 at 17:36
  • This code block is the only change I made in the version I am testing against the original. I'm not sure I understand what you mean when you say that I'm not measuring correctly. If this is the only change, shouldn't the issue lie here? – KevenDenen Aug 19 '13 at 17:47
  • There are several statements in that method that are significantly different could be the problem. You are assuming its the regex, but it could be the Len()'s, the Mid(), Substring(), conditional statements, compilation in debug mode (the debugger can lie/heisenbug), first time JIT, etc. Any/all of those could be measured to see where the problem is. – StingyJack Aug 19 '13 at 20:24

3 Answers3

3

I conducted some tests with the Visual Studio Profiler and I did not get the same results you did. There was a logical error is your Regex function that caused the length check to be missed if the number didn't begin with 1. I corrected that in my tests.

  1. I realized in my tests, that whatever function went first and last would suffer a penalty. So I executed each function independently and had a priming function run before.
  2. Depending on the tests I executed the function either 10000 or 100000 times with a phone number like pattern of varying length. Each method got the same numbers.

Results

In general my method was always slightly faster.

  1. I did a cheap timer test, the Original function was twice as slow.
  2. Profiler showed the Original Method used about 60% more memory than our methods.
  3. Profiler showed the Original Method took eight times as long to work.
  4. Profiler showed the Original Method took about 40% more processor cycles.

My Conclusion

In all tests the Original method was much slower. Had it come out better in one test then I be able to explain our discrepancy. Ff you tested those methods in total isolation I think you will come up with something similar.

My best guess is something else was effecting your results and that your assessment that the Original method was better is false.

Your Revised Function

Function GetPhoneNumberRegex(strPhoneNum As String)
    Dim strNewPhone As String = ""
    strNewPhone = Regex.Replace(strPhoneNum, "\D", "")
    If strNewPhone <> "" And strNewPhone.Substring(0, 1) = "1" Then
        strNewPhone = Mid(strNewPhone, 2)
    End If

    If Len(strNewPhone) = 7 Or Len(strNewPhone) = 10 Then
        Return strNewPhone
    End If

    Return ""
End Function

My Function

Function GetPhoneNumberMine(strPhoneNum As String)
    Dim strNewPhone As String = Regex.Replace(strPhoneNum, "\D", "")
    If (strNewPhone.Length >= 7 And strNewPhone(0) = "1") Then
        strNewPhone = strNewPhone.Remove(0, 1)
    End If

    Return If(strNewPhone.Length = 7 OrElse strNewPhone.Length = 10, strNewPhone, "")
End Function
Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
  • Welp, I'm not sure why it all timed longer the first time through. I ran 5 tests of 200,000 records and my version took between 3 and 5 minutes longer. I'll test it out again with your cleaner version and see what I come up with. – KevenDenen Aug 19 '13 at 20:53
  • Woah! 100,000 tests took about three (maybe it was 30) seconds for my cheap timer test. My processor and mobo are a year old, and were modest at the time. Also I am using .Net 4.5. – Daniel Gimenez Aug 19 '13 at 20:59
  • I wasn't just running this function in isolation. I was running it as a part of the larger tool. This function gets called over and over again throughout this tool. It's the most commonly called function, so it's where I started. – KevenDenen Aug 19 '13 at 21:20
1

Doing repeating things like this will slow you down, if its hitting this condition.

 If Len(strNewPhone) = 7 Or Len(strNewPhone) = 10 Then
     Return strNewPhone
 End If

Instead, do this...

     Dim value = Len(strNewPhone)
     If value = 7 OrElse value = 10 Then
         Return strNewPhone
     End If

But you should still be measuring the individual pieces (conditions/statements) to determine which of them is slowing you down, but only if it really matters.

StingyJack
  • 19,041
  • 10
  • 63
  • 122
1

I don't know if you're seeing a real problem, but the code you show might very well be slower, because the regex is being freshly compiled each time you use it.

See if this is any better:

Regex rx = new Regex("\D") ' do this once, use it each time

Reference at MSDN

egrunin
  • 24,650
  • 8
  • 50
  • 93
  • He isnt looping characters in the second one. – StingyJack Aug 19 '13 at 17:34
  • Yes, but the assembly translation of his looping/testing version is very simple and very fast. Some regex implementations have a surprisingly slow compilation step. – egrunin Aug 19 '13 at 17:35
  • The compilation cost will be incurred only once in the second example. I suspect that if he measures properly, the actual problem will be more apparent. – StingyJack Aug 19 '13 at 17:38
  • I understand, but I was (perhaps mistakenly) assuming he was measuring the two code examples called 1000 times each. Regardless, @Daniel Gimenez has given this the answer it needs. – egrunin Aug 19 '13 at 20:28