33

I have three arrays that need to be combined in one three-dimension array. The following code shows slow performance in Performance Explorer. Is there a faster solution?

for (int i = 0; i < sortedIndex.Length; i++) {
    if (i < num_in_left)
    {    
        // add instance to the left child
        leftnode[i, 0] = sortedIndex[i];
        leftnode[i, 1] = sortedInstances[i];
        leftnode[i, 2] = sortedLabels[i];
    }
    else
    { 
        // add instance to the right child
        rightnode[i-num_in_left, 0] = sortedIndex[i];
        rightnode[i-num_in_left, 1] = sortedInstances[i];
        rightnode[i-num_in_left, 2] = sortedLabels[i];
    }                    
}

Update:

I'm actually trying to do the following:

//given three 1d arrays
double[] sortedIndex, sortedInstances, sortedLabels;
// copy them over to a 3d array (forget about the rightnode for now)
double[] leftnode = new double[sortedIndex.Length, 3];
// some magic happens here so that
leftnode = {sortedIndex, sortedInstances, sortedLabels};
Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
Yang
  • 6,682
  • 20
  • 64
  • 96
  • 2
    I was going to suggest unsafe code, but then found this: http://stackoverflow.com/questions/85479/c-unsafe-fixed-code. `Array.Copy`, as Marlon suggests is probably the best way to go. – dariom Feb 24 '11 at 02:14

5 Answers5

76

Use Buffer.BlockCopy. Its entire purpose is to perform fast (see Buffer):

This class provides better performance for manipulating primitive types than similar methods in the System.Array class.

Admittedly, I haven't done any benchmarks, but that's the documentation. It also works on multidimensional arrays; just make sure that you're always specifying how many bytes to copy, not how many elements, and also that you're working on a primitive array.

Also, I have not tested this, but you might be able to squeeze a bit more performance out of the system if you bind a delegate to System.Buffer.memcpyimpl and call that directly. The signature is:

internal static unsafe void memcpyimpl(byte* src, byte* dest, int len)

It does require pointers, but I believe it's optimized for the highest speed possible, and so I don't think there's any way to get faster than that, even if you had assembly at hand.


Update:

Due to requests (and to satisfy my curiosity), I tested this:

using System;
using System.Diagnostics;
using System.Reflection;

unsafe delegate void MemCpyImpl(byte* src, byte* dest, int len);

static class Temp
{
    //There really should be a generic CreateDelegate<T>() method... -___-
    static MemCpyImpl memcpyimpl = (MemCpyImpl)Delegate.CreateDelegate(
        typeof(MemCpyImpl), typeof(Buffer).GetMethod("memcpyimpl",
            BindingFlags.Static | BindingFlags.NonPublic));
    const int COUNT = 32, SIZE = 32 << 20;

    //Use different buffers to help avoid CPU cache effects
    static byte[]
        aSource = new byte[SIZE], aTarget = new byte[SIZE],
        bSource = new byte[SIZE], bTarget = new byte[SIZE],
        cSource = new byte[SIZE], cTarget = new byte[SIZE];


    static unsafe void TestUnsafe()
    {
        Stopwatch sw = Stopwatch.StartNew();
        fixed (byte* pSrc = aSource)
        fixed (byte* pDest = aTarget)
            for (int i = 0; i < COUNT; i++)
                memcpyimpl(pSrc, pDest, SIZE);
        sw.Stop();
        Console.WriteLine("Buffer.memcpyimpl: {0:N0} ticks", sw.ElapsedTicks);
    }

    static void TestBlockCopy()
    {
        Stopwatch sw = Stopwatch.StartNew();
        sw.Start();
        for (int i = 0; i < COUNT; i++)
            Buffer.BlockCopy(bSource, 0, bTarget, 0, SIZE);
        sw.Stop();
        Console.WriteLine("Buffer.BlockCopy: {0:N0} ticks",
            sw.ElapsedTicks);
    }

    static void TestArrayCopy()
    {
        Stopwatch sw = Stopwatch.StartNew();
        sw.Start();
        for (int i = 0; i < COUNT; i++)
            Array.Copy(cSource, 0, cTarget, 0, SIZE);
        sw.Stop();
        Console.WriteLine("Array.Copy: {0:N0} ticks", sw.ElapsedTicks);
    }

    static void Main(string[] args)
    {
        for (int i = 0; i < 10; i++)
        {
            TestArrayCopy();
            TestBlockCopy();
            TestUnsafe();
            Console.WriteLine();
        }
    }
}

The results:

Buffer.BlockCopy: 469,151 ticks
Array.Copy: 469,972 ticks
Buffer.memcpyimpl: 496,541 ticks

Buffer.BlockCopy: 421,011 ticks
Array.Copy: 430,694 ticks
Buffer.memcpyimpl: 410,933 ticks

Buffer.BlockCopy: 425,112 ticks
Array.Copy: 420,839 ticks
Buffer.memcpyimpl: 411,520 ticks

Buffer.BlockCopy: 424,329 ticks
Array.Copy: 420,288 ticks
Buffer.memcpyimpl: 405,598 ticks

Buffer.BlockCopy: 422,410 ticks
Array.Copy: 427,826 ticks
Buffer.memcpyimpl: 414,394 ticks

Now change the order:

Array.Copy: 419,750 ticks
Buffer.memcpyimpl: 408,919 ticks
Buffer.BlockCopy: 419,774 ticks

Array.Copy: 430,529 ticks
Buffer.memcpyimpl: 412,148 ticks
Buffer.BlockCopy: 424,900 ticks

Array.Copy: 424,706 ticks
Buffer.memcpyimpl: 427,861 ticks
Buffer.BlockCopy: 421,929 ticks

Array.Copy: 420,556 ticks
Buffer.memcpyimpl: 421,541 ticks
Buffer.BlockCopy: 436,430 ticks

Array.Copy: 435,297 ticks
Buffer.memcpyimpl: 432,505 ticks
Buffer.BlockCopy: 441,493 ticks

Now change the order again:

Buffer.memcpyimpl: 430,874 ticks
Buffer.BlockCopy: 429,730 ticks
Array.Copy: 432,746 ticks

Buffer.memcpyimpl: 415,943 ticks
Buffer.BlockCopy: 423,809 ticks
Array.Copy: 428,703 ticks

Buffer.memcpyimpl: 421,270 ticks
Buffer.BlockCopy: 428,262 ticks
Array.Copy: 434,940 ticks

Buffer.memcpyimpl: 423,506 ticks
Buffer.BlockCopy: 427,220 ticks
Array.Copy: 431,606 ticks

Buffer.memcpyimpl: 422,900 ticks
Buffer.BlockCopy: 439,280 ticks
Array.Copy: 432,649 ticks

or, in other words: they're very competitive; as a general rule, memcpyimpl is fastest, but it's not necessarily worth worrying about.

user541686
  • 205,094
  • 128
  • 528
  • 886
  • 2
    C'mon, man, benchmark it! I always assumed Buffer.BlockCopy is faster, but I'm not sure anymore. `Hans Passant` (way down in the page) claims the exact same CLR code executes for both: http://social.msdn.microsoft.com/Forums/en-US/netfxbcl/thread/e3a08e63-7188-4f87-bb0a-fed6c8acf553 – MusiGenesis Feb 24 '11 at 02:26
  • I'm curious to know if your last suggestion works, and how well if so. – MusiGenesis Feb 24 '11 at 02:27
  • @MusiGenesis: I guess `memcpyimpl` is the way to go then? (Not that I've benchmarked that either, though I have used it before. I'll benchmark it right now.) – user541686 Feb 24 '11 at 02:28
  • What happens if you change the order of the runs (like do Unsafe/Array/Block etc.)? Also, this is compiled and not running in the debugger? – MusiGenesis Feb 24 '11 at 04:18
  • Oh shoot... it's not attached to the debugger, no, but it's a debug build; I forgot to change it. I'll rerun it. As for changing the order, it made a little bit of a difference, but not much; I think the ranking was the same... lemme try that again. – user541686 Feb 24 '11 at 04:20
  • Thanks for doing this, by the way. Fortunately, it looks like the unsafe method isn't faster enough to justify rewriting a bunch of array-copying code. – MusiGenesis Feb 24 '11 at 04:22
  • @MusiGenesis: Yup, I agree. It's a tiny bit faster but occasionally slower, and not worth the optimization IMHO. – user541686 Feb 24 '11 at 04:31
  • 5
    `Hans Passant`: answering StackOverflow questions before there even *was* a StackOverflow! I think that officially puts him in Jon Skeet territory. – MusiGenesis Feb 24 '11 at 04:35
  • 1
    It should be noted that the `memcpyimpl` method no longer exists in at least .NET 4.5.1, it's now called `Memcpy` and there are various overloads for it so you need to pass in the parameter types to resolve the method you want. – ChrisWue Oct 06 '14 at 20:18
  • Buffer.BlockCopy invokes the C stdlib memcpy, which in turn uses DMA (at least on x86 architecture). DMA is much faster than transferring data using the CPU. – Jonathan Dickinson Oct 02 '15 at 09:55
  • @JonathanDickinson: I believe you 100% made that up, because it's blatantly false. – user541686 Oct 02 '15 at 10:18
  • @Mehrdad - that's also 100% made up. I was told it, I didn't come up with it myself. Yes, I've just done the research and memcpy doesn't use DMA. Millions ways to tell me that, man... A *million* other ways. – Jonathan Dickinson Oct 02 '15 at 13:28
  • TestArrayCopy and TestBlockCopy methods have a redundant line of code. It is `sw.Start();`. TestUnsafe method don't have this line. – omerfarukdogan Sep 19 '16 at 17:54
  • Great answer... Thanks a lots – Eric Ouellet Jun 01 '17 at 13:29
11

You can use Array.Copy.

EDIT

Array.Copy does work for multidimensional arrays: see this topic.

AustinWBryan
  • 3,249
  • 3
  • 24
  • 42
Marlon
  • 19,924
  • 12
  • 70
  • 101
  • I looked at your link, but my situation is different. The source comes from three different 1d arrays. The dest array is an N by 3 array where each dimension contains one of the source array. – Yang Feb 24 '11 at 07:10
6

If running on .NET Core, you may consider using source.AsSpan().CopyTo(destination) (beware on Mono though).

          Method |  Job | Runtime |      Mean |     Error |    StdDev | Ratio | RatioSD |
---------------- |----- |-------- |----------:|----------:|----------:|------:|--------:|
       ArrayCopy |  Clr |     Clr |  60.08 ns | 0.8231 ns | 0.7699 ns |  1.00 |    0.00 |
        SpanCopy |  Clr |     Clr |  99.31 ns | 0.4895 ns | 0.4339 ns |  1.65 |    0.02 |
 BufferBlockCopy |  Clr |     Clr |  61.34 ns | 0.5963 ns | 0.5578 ns |  1.02 |    0.01 |
                 |      |         |           |           |           |       |         |
       ArrayCopy | Core |    Core |  63.33 ns | 0.6843 ns | 0.6066 ns |  1.00 |    0.00 |
        SpanCopy | Core |    Core |  47.41 ns | 0.5399 ns | 0.5050 ns |  0.75 |    0.01 |
 BufferBlockCopy | Core |    Core |  59.89 ns | 0.4713 ns | 0.3936 ns |  0.94 |    0.01 |
                 |      |         |           |           |           |       |         |
       ArrayCopy | Mono |    Mono | 149.82 ns | 1.6466 ns | 1.4596 ns |  1.00 |    0.00 |
        SpanCopy | Mono |    Mono | 347.87 ns | 2.0589 ns | 1.9259 ns |  2.32 |    0.02 |
 BufferBlockCopy | Mono |    Mono |  61.52 ns | 1.1691 ns | 1.0364 ns |  0.41 |    0.01 |
Honza R
  • 711
  • 7
  • 14
4

For primitive type arrays (like double) you can copy fast, even for multidimensional array with pointers.

In the code below I initialize a 2D array A[10,10] with the values 1 through 100. Then I copy these values into a 1D array B[100]

unsafe class Program
{ 
    static void Main(string[] args)
    {
        double[,] A = new double[10, 10];

        for(int i = 0; i < 10; i++)
        {
            for(int j = 0; j < 10; j++)
            {
                A[i, j] = 10 * i + j + 1;
            }
        }
        // A has { { 1 ,2 .. 10}, { 11, 12 .. 20}, .. { .. 99, 100} }
        double[] B = new double[10 * 10];

        if (A.Length == B.Length)
        {
            fixed (double* pA = A, pB = B)
            {
                for(int i = 0; i < B.Length; i++)
                {
                    pB[i] = pA[i];
                }
            }
            // B has {1, 2, 3, 4 .. 100}
        }
    }
}

How fast is it. My testing has shown it to be many times faster then native C# copy and Buffer.BlockCopy(). You try it for your case and let us know.

Edit 1 I compared copying with four methods. 1) Two Nested loops, 2) One Serial loop, 3) Pointers, 4) BlockCopy. I measured the # of copies per tick for various size arrays.

N =   10x  10 (cpy/tck) Nested = 50,  Serial = 33, Pointer =    100, Buffer =    16
N =   20x  20 (cpy/tck) Nested = 133, Serial = 40, Pointer =    400, Buffer =   400
N =   50x  50 (cpy/tck) Nested = 104, Serial = 40, Pointer =   2500, Buffer =  2500
N =  100x 100 (cpy/tck) Nested = 61,  Serial = 41, Pointer =  10000, Buffer =  3333
N =  200x 200 (cpy/tck) Nested = 84,  Serial = 41, Pointer =  40000, Buffer =  2666
N =  500x 500 (cpy/tck) Nested = 69,  Serial = 41, Pointer = 125000, Buffer =  2840
N = 1000x1000 (cpy/tck) Nested = 33,  Serial = 45, Pointer = 142857, Buffer =  1890
N = 2000x2000 (cpy/tck) Nested = 30,  Serial = 43, Pointer = 266666, Buffer =  1826
N = 5000x5000 (cpy/tck) Nested = 21,  Serial = 42, Pointer = 735294, Buffer =  1712

It is clear here who is the winner. Pointer copy is orders of magnitudes better than any other method.

Edit 2 Apparently I was unfairly taking advantage of a compiler/JIT optimization because when I moved the loops behind delegates to equalize the playing field the numbers changed dramatically.

N =   10x  10 (cpy/tck) Nested =  0, Serial =  0, Pointer =      0, Buffer =     0
N =   20x  20 (cpy/tck) Nested = 80, Serial = 14, Pointer =    100, Buffer =   133
N =   50x  50 (cpy/tck) Nested =147, Serial = 15, Pointer =    277, Buffer =  2500
N =  100x 100 (cpy/tck) Nested = 98, Serial = 15, Pointer =    285, Buffer =  3333
N =  200x 200 (cpy/tck) Nested =106, Serial = 15, Pointer =    272, Buffer =  3076
N =  500x 500 (cpy/tck) Nested =106, Serial = 15, Pointer =    276, Buffer =  3125
N = 1000x1000 (cpy/tck) Nested =101, Serial = 11, Pointer =    199, Buffer =  1396
N = 2000x2000 (cpy/tck) Nested =105, Serial =  9, Pointer =    186, Buffer =  1804
N = 5000x5000 (cpy/tck) Nested =102, Serial =  8, Pointer =    170, Buffer =  1673

The buffered copy is top here (thanks to @Mehrdad) with pointer copy second. The question now is why isn't pointer copy as fast as buffer methods?

AustinWBryan
  • 3,249
  • 3
  • 24
  • 42
John Alexiou
  • 28,472
  • 11
  • 77
  • 133
  • Buffer.BlockCopy uses internally memmove, which, I guess, copies any number of bytes in one assembler command. This should be much faster than using pointers, which copy in one instruction only 1 double and need to loop many times. – Peter Huber Jan 30 '16 at 16:27
  • Yes, it makes sense. It would make sense for the buffer to be smaller than CPU cache to keep things fast. I wonder if it adjusts according to the CPU architecture. – John Alexiou Jan 30 '16 at 16:49
  • Most likely the Buffer Copy uses data parallel SIMD/SSE CPU instructions to move up to 512-bits at a time, which moves eight doubles per instructions – DragonSpit Sep 05 '19 at 17:11
0

If a jagged array of the following form would work, then a copy could be avoided:

double[][] leftNode = new double[3][];
leftNode[0] = sortedIndex;
leftNode[1] = sortedInstances;
leftNode[2] = sortedLabels;
DragonSpit
  • 458
  • 4
  • 9