Why would I see ~20% speed increase using native code?

Question

Any idea why this code:

extern "C" __declspec(dllexport) void Transform(double x[], double y[], int iterations, bool forward)
{
    long n, i, i1, j, k, i2, l, l1, l2;
    double c1, c2, tx, ty, t1, t2, u1, u2, z;

    /* Calculate the number of points */
    n = (long)pow((double)2, (double)iterations);

    /* Do the bit reversal */
    i2 = n >> 1;
    j = 0;
    for (i = 0; i < n - 1; ++i)
    {
        if (i < j)
        {
            tx = x[i];
            ty = y[i];
            x[i] = x[j];
            y[i] = y[j];
            x[j] = tx;
            y[j] = ty;
        }
        k = i2;
        while (k <= j)
        {
            j -= k;
            k >>= 1;
        }
        j += k;
    }

    /* Compute the FFT */
    c1 = -1.0; 
    c2 = 0.0;
    l2 = 1;
    for (l = 0; l < iterations; ++l)
    {
        l1 = l2;
        l2 <<= 1;
        u1 = 1; 
        u2 = 0;
        for (j = 0; j < l1; j++) 
        {
            for (i = j; i < n; i += l2) 
            {
                i1 = i + l1;
                t1 = u1 * x[i1] - u2 * y[i1];
                t2 = u1 * y[i1] + u2 * x[i1];
                x[i1] = x[i] - t1; 
                y[i1] = y[i] - t2;
                x[i] += t1;
                y[i] += t2;
            }
            z = u1 * c1 - u2 * c2;
            u2 = u1 * c2 + u2 * c1;
            u1 = z;
        }
        c2 = sqrt((1.0 - c1) / 2.0);
        if (forward) 
            c2 = -c2;
        c1 = sqrt((1.0 + c1) / 2.0);
    }

    /* Scaling for forward transform */
    if (forward)
    {
        for (i = 0; i < n; ++i)
        {
            x[i] /= n;
            y[i] /= n;
        }
    }
}

runs 20% faster than this code?

public static void Transform(DataSet data, Direction direction)
{
    double[] x = data.Real;
    double[] y = data.Imag;
    data.Direction = direction;
    data.ExtremeImag = 0.0;
    data.ExtremeReal = 0.0;
    data.IndexExtremeImag = 0;
    data.IndexExtremeReal = 0;

    long n, i, i1, j, k, i2, l, l1, l2;
    double c1, c2, tx, ty, t1, t2, u1, u2, z;

    /* Calculate the number of points */
    n = (long)Math.Pow(2, data.Iterations);

    /* Do the bit reversal */
    i2 = n >> 1;
    j = 0;
    for (i = 0; i < n - 1; ++i)
    {
        if (i < j)
        {
            tx = x[i];
            ty = y[i];
            x[i] = x[j];
            y[i] = y[j];
            x[j] = tx;
            y[j] = ty;
        }
        k = i2;
        while (k <= j)
        {
            j -= k;
            k >>= 1;
        }
        j += k;
    }

    /* Compute the FFT */
    c1 = -1.0; 
    c2 = 0.0;
    l2 = 1;
    for (l = 0; l < data.Iterations; ++l)
    {
        l1 = l2;
        l2 <<= 1;
        u1 = 1; 
        u2 = 0;
        for (j = 0; j < l1; j++) 
        {
            for (i = j; i < n; i += l2) 
            {
                i1 = i + l1;
                t1 = u1 * x[i1] - u2 * y[i1];
                t2 = u1 * y[i1] + u2 * x[i1];
                x[i1] = x[i] - t1; 
                y[i1] = y[i] - t2;
                x[i] += t1;
                y[i] += t2;
            }
            z = u1 * c1 - u2 * c2;
            u2 = u1 * c2 + u2 * c1;
            u1 = z;
        }
        c2 = Math.Sqrt((1.0 - c1) / 2.0);
        if (direction == Direction.Forward) 
            c2 = -c2;
        c1 = Math.Sqrt((1.0 + c1) / 2.0);
    }

    /* Scaling for forward transform */
    if (direction == Direction.Forward)
    {
        for (i = 0; i < n; ++i)
        {
            x[i] /= n;
            y[i] /= n;
            if (Math.Abs(x[i]) > data.ExtremeReal)
            {
                data.ExtremeReal = x[i];
                data.IndexExtremeReal = (int)i;
            }
            if (Math.Abs(y[i]) > data.ExtremeImag)
            {
                data.ExtremeImag = y[i];
                data.IndexExtremeImag = (int)i;
            }
        }
    }
}

FFT http://www.rghware.com/fft.png

I created the drop in CPU seen in the middle of the graph by selecting the “Native DLL FFT” in my app:

http://www.rghware.com/InstrumentTuner.zip (source code)

I think this will run on most PCs. You’ll need to have DirectX installed. I had some issues using the capture settings for certain hardware. The capture settings were supposed to be configurable, but the app’s development has been sidetracked by this interesting find.

Anyway, why I’m seeing a 20% increase in speed using the native code? This seems to fly in the face of some of the assumptions I previously had.

UPDATE

After converting the function to an unsafe method and fixing the long/int issue. The new unsafe method actually runs faster than the native method (pretty cool).

Profile http://www.rghware.com/profile.png

It's obvious that the array bound checking is the cause of the 20% slow down in this FFT method. Due to it's nature, the for-loops in this method cannot be optimized.

Thanks everyone for the help.

I uploaded the source again (this time with the Class Libraries. — Robert H., May 19 '09 at 16:20
How are you doing the test for speed comparison? Are you running in release outside of a debugger (important), as well as running multiple times (to make sure you're not getting any JIT issues)? — Reed Copsey, May 19 '09 at 16:26
I'm running it in release outside the IDE. I'm using System.Diagnostics.Stopwatch to test the speed of these function. I put the results on the form so I can watch them and toggle back and forth via radio buttons. The function is performed basically continuously in half second intervals on the data coming into the sound card. I've running the test many, many times. — Robert H., May 19 '09 at 16:40
I installed one of the profilers (jet brains). While it's nice to quickly see what's eating my CPU, it didn't reveal anything I didn't already learn via playing with System.Diagnostics.Stopwatch. These this FFT function is the bottle neck. The unsafe flag erks me, but I think I'll check it out just to see what happens. — Robert H., May 19 '09 at 17:41
Can you make your speed tests more granule? Determine what part of the C# code is running slower? For example, I'd consider Math.Sqrt() as a potential bottleneck, depending how it is implemented. Also, the assignments to data.ExtremeReal and data.IndexExtremeImag are invoking property setters? Those will be method calls, which means pushing and popping the stack and jumps, unless the jitter inlines them. Maybe assign that data to arrays and have property setters that take those arrays and assigns them to the backing storage inside the DataSet, instead of a property set on each iteration. — Grant Wagner, May 19 '09 at 22:24
@Robert Hamilton: Try switching JetBrains to using the tracing profiler instead of the sampling profiler. It will run MUCH slower, but you can get a very different level of information about bottlenecks. AQTime is another tracing-only profiler (which I prefer, actually) for this type of work. Also, did my last suggestion (see my answer) help? — Reed Copsey, May 20 '09 at 15:55
Sorry about that. The site was hiding one of the comments. But, you hit it exactly with the long/int issue. I updated the post with the results. — Robert H., May 20 '09 at 18:16

score 23 · Accepted Answer · edited May 23 '17 at 12:02

23

Just looking at this code, I'd suspect from my experience a fairly significant slowdown going from C++ -> C#.

One major issue you're going to face in a naive port of a routine like this to C# is that C# is going to add bounds checking on every array check here. Since you're never looping through the arrays in a way that will get optimized (see this question for details), just about every array access is going to receive bounds checking.

In addition, this port is pretty close to a 1->1 mapping from C. If you run this through a good .NET profiler, you'll probably find some great spots that can be optimized to get this back to near C++ speed with one or two tweaks (that's nearly always been my experience in porting routines like this).

If you want to get this to be at nearly identical speed, though, you'll probably need to convert this to unsafe code and use pointer manipulation instead of directly setting the arrays. This will eliminate all of the bounds checking issues, and get your speed back.

Edit: I see one more huge difference, which may be the reason your C# unsafe code is running slower.

Check out this page about C# compared to C++, in particular:

"The long type: In C#, the long type is 64 bits, while in C++, it is 32 bits."

You should convert the C# version to use int, not long. In C#, long is a 64bit type. This may actually have a profound effect on your pointer manipulation, because I believe you are inadvertantly adding a long->int conversion (with overflow checking) on every pointer call.

Also, while you're at it, you may want to try wrapping the entire function in an unchecked block. C++ isn't doing the overflow checking you're getting in C#.

edited May 23 '17 at 12:02

Community

1
1

answered May 19 '09 at 16:17

Reed Copsey

554,122
78
1,158
1,373

Rico Mariani about bounds checking: http://blogs.msdn.com/ricom/archive/2006/07/12/663642.aspx – VVS May 19 '09 at 17:34
@David: I love Rico Mariani's blog, but in this case, it's not a fair comparison. He was comparing the speed reduction in a typical GUI application, not a pure number crunching situation. Even then, he was seeing a 3% drop overall from array bounds checking. This routine is nearly all array access, though, so the number will not be buffered by the other computations. I'd expect much higher than the 3% drop (.6% if you remove the lowest 10%) – Reed Copsey May 19 '09 at 17:55
I converted my code to an unsafe method that uses pointers to manipulate the data (see my latest edit in the body of the question.) The results were not what I expected. Is there something wrong with my unsafe implementation? I can post the new code if needed. – Robert H. May 19 '09 at 19:55
@Robert Hamilton: See my edit. I believe the "long" declaration is changing the behavior here. – Reed Copsey May 20 '09 at 01:31
@Reed Copsey: Good call! Pointer manipulation via long integer was the problem. I changed in the unsafe methods to ints and it's running really fast right now. I updated the post with the new profiler results - impressive. – Robert H. May 20 '09 at 18:11
Glad that we figured it out... this has matched my exp. in the past. :) – Reed Copsey May 20 '09 at 21:08

score 3 · Answer 2 · answered May 19 '09 at 16:16

This is most likely due to the JIT compiler generating code that's not as efficient as the code generated by the native compiler.

Profiling the code should probably be your next step if you care about the 20% decrease in performance, or you might consider using an off the shelf optimized library.

score 3 · Answer 3 · answered May 19 '09 at 17:27

3

The native compiler can do much deeper and heavier optimizations than a JIT compiler, like vectorization, interprocedural analysis, etc. And FFTs can get great speedups with vectorization.

answered May 19 '09 at 17:27

Bruno Coutinho

346
2
7

score 1 · Answer 4 · answered May 19 '09 at 16:19

1

Considering that the managed code does bounds checking on the index of every array access, which the unmanaged code doesn't do, I would say that the difference is smaller than I expected.

If you change the arrays to pointers in the managed code too (as that is what they really are in the unmanaged code), I would expect them to perform about the same.

answered May 19 '09 at 16:19

Guffa

687,336
108
737
1,005

note that there are cases where the array bounds checking can be optimized away (such as when limit is explicitly < array.Length), but not here – Lucas May 19 '09 at 17:30

score 1 · Answer 5 · answered May 20 '09 at 05:34

I just ran the code he posted with int instead of long and it did not really make a difference. I know other people have had better luck with FFT in .NET, showing that .NET can reach or exceed the proformance of C++ even with FFT math.

So my answer, either the poster's code is more optimized (for C) then the one in the link, or it is less optimized for C# than the one in the article I linked.

I performed two sets of tests on two machines with .NET 2.0. One machine had XPSP2 and had a single processor, 850MHz Pentium III, with 512Mb of RAM. The other machine had build 5321 of Vista and had a single processor, 2 GHz Mobile Pentium 4, with 1Gb of RAM. In each case I calculated the average of 100 separate FFT calculations on 217 (131072) data values. From these values I calculated the standard error from the standard deviation.

The results are shown in ms. The results for the Pentium III machine are:
  Not Optimized   Optimized For Space Optimized For Speed
Unmanaged   92.88 ± 0.09    88.23 ± 0.09    68.48 ± 0.03
Managed C++ 72.89 ± 0.03    72.26 ± 0.04    71.35 ± 0.06
C++/CLI 73.00 ± 0.05    72.32 ± 0.03    71.44 ± 0.04
C# Managed  72.21 ± 0.04    69.97 ± 0.08
The results for the Mobile Pentium 4 are:
          Not Optimized   Optimized For Space Optimized For Speed
Unmanaged   45.2 ± 0.1  30.04 ± 0.04    23.06 ± 0.04
Managed C++ 23.5 ± 0.1  23.17 ± 0.08    23.36 ± 0.07
C++/CLI 23.5 ± 0.1  23.11 ± 0.07    23.80 ± 0.05
C# Managed  23.7 ± 0.1  22.78 ± 0.03

I cannot seem to duplicate this: Average for native code: 0.007396 seconds +/- 0.000017 Average for managed C# code: 0.008422 seconds +/- 0.000027 I'm still getting a %18-19 speed increase on my machine using the native code (his not mine). I'm going put in on my laptop to see if there is any difference there. — Robert H., May 20 '09 at 14:31
Make me wonder what improvements in OS and processor designs have had on those numbers in general. — JasonRShaver, May 20 '09 at 19:09

score 0 · Answer 6 · answered May 19 '09 at 16:08

Did you use a profiler like AQTime to see where the bottle neck is? Sometimes it's a trivial thing when translating native to managed code. On the other hand, because managed code in some scenarios is slower than native code, you might want to try unsafe code instead.

score 0 · Answer 7 · answered Jun 13 '09 at 11:39

0

Because C# .NET compiler is not the best in producing efficient code. And the entire logic of the language prevents from that. BTW, F# has much better performance than C# in Math

answered Jun 13 '09 at 11:39

Rinat Abdullin

23,036
8
57
80

Why would I see ~20% speed increase using native code?

7 Answers7

Linked