3

In C#, there are some different ways to copy the elements of an array to another. To the best of my knowledge, they are "For" loop, Array.CopyTo, Span<T>.CopyTo, T[].CopyTo and Buffer.BlockCopy.

Since looping to copy the elements is always the slowest way, I skip it and run benchmark test for the other four methods. However, it seems that the speed of them are related with the length of the array, which really confused me.

My code of benchmark test is shown below. My experiment environment is Windows 11, .NET 6, Intel 12700 CPU, 64bits, using "BenchmarkDotnet" as the benchmark test framework.

public class UnitTest1
{
    static readonly int times = 1000;
    static readonly int arrayLength = 8;
    int[] src = GetRandomArray(arrayLength);
    int[] dst = new int[arrayLength];
    public static int[] GetRandomArray(int length)
    {
        int[] array = new int[length];
        for (int i = 0; i < length; i++)
        {
            array[i] = new Random(DateTime.Now.Millisecond).Next(int.MinValue, int.MaxValue);
        }
        System.Threading.Thread.Sleep(2000);
        return array;
    }

    [Benchmark]
    public void TestArrayCopy()
    {
        for (var j = 0; j < times; j++)
        {
            src.CopyTo(dst, 0);
        }
    }
    [Benchmark]
    public void TestSingleSpanCopy()
    {
        var dstSpan = dst.AsSpan();
        for (var j = 0; j < times; j++)
        {
            src.CopyTo(dstSpan);
        }
    }
    [Benchmark]
    public void TestDoubleSpanCopy()
    {
        var srcSpan = src.AsSpan();
        var dstSpan = dst.AsSpan();
        for (var j = 0; j < times; j++)
        {
            srcSpan.CopyTo(dstSpan);
        }
    }
    [Benchmark]
    public void BufferCopy()
    {
        for (var j = 0; j < times; j++)
        {
            System.Buffer.BlockCopy(src, 0, dst, 0, sizeof(int) * src.Length);
        }
    }
}

Here are the test results.

times = 1000, arrayLength = 8

|             Method |     Mean |     Error |    StdDev |
|------------------- |---------:|----------:|----------:|
|      TestArrayCopy | 3.061 us | 0.0370 us | 0.0543 us |
| TestSingleSpanCopy | 1.297 us | 0.0041 us | 0.0038 us |
| TestDoubleSpanCopy | 1.113 us | 0.0190 us | 0.0203 us |
|         BufferCopy | 7.162 us | 0.1250 us | 0.1044 us |

times = 1000, arrayLength = 16

|             Method |     Mean |     Error |    StdDev |
|------------------- |---------:|----------:|----------:|
|      TestArrayCopy | 3.426 us | 0.0677 us | 0.0806 us |
| TestSingleSpanCopy | 1.609 us | 0.0264 us | 0.0206 us |
| TestDoubleSpanCopy | 1.478 us | 0.0228 us | 0.0202 us |
|         BufferCopy | 7.465 us | 0.0866 us | 0.0723 us |

times = 1000, arrayLength = 32

|             Method |      Mean |     Error |    StdDev |    Median |
|------------------- |----------:|----------:|----------:|----------:|
|      TestArrayCopy |  4.063 us | 0.0417 us | 0.0390 us |  4.076 us |
| TestSingleSpanCopy |  4.115 us | 0.3552 us | 1.0473 us |  4.334 us |
| TestDoubleSpanCopy |  3.576 us | 0.3391 us | 0.9998 us |  3.601 us |
|         BufferCopy | 12.922 us | 0.7339 us | 2.1640 us | 13.814 us |

times = 1000, arrayLength = 128

|             Method |      Mean |     Error |    StdDev |    Median |
|------------------- |----------:|----------:|----------:|----------:|
|      TestArrayCopy |  7.865 us | 0.0919 us | 0.0815 us |  7.842 us |
| TestSingleSpanCopy |  7.036 us | 0.2694 us | 0.7900 us |  7.256 us |
| TestDoubleSpanCopy |  7.351 us | 0.0914 us | 0.0855 us |  7.382 us |
|         BufferCopy | 10.955 us | 0.1157 us | 0.1083 us | 10.947 us |

times = 1000, arrayLength = 1024

|             Method |     Mean |    Error |    StdDev |   Median |
|------------------- |---------:|---------:|----------:|---------:|
|      TestArrayCopy | 45.16 us | 3.619 us | 10.670 us | 48.95 us |
| TestSingleSpanCopy | 36.85 us | 3.608 us | 10.638 us | 34.77 us |
| TestDoubleSpanCopy | 38.88 us | 3.378 us |  9.960 us | 39.91 us |
|         BufferCopy | 48.83 us | 4.352 us | 12.833 us | 53.65 us |

times = 1000, arrayLength = 16384

|             Method |     Mean |     Error |    StdDev |
|------------------- |---------:|----------:|----------:|
|      TestArrayCopy | 1.417 ms | 0.1096 ms | 0.3233 ms |
| TestSingleSpanCopy | 1.487 ms | 0.1012 ms | 0.2983 ms |
| TestDoubleSpanCopy | 1.438 ms | 0.1115 ms | 0.3287 ms |
|         BufferCopy | 1.423 ms | 0.1147 ms | 0.3383 ms |

times = 100, arrayLength = 65536

|             Method |     Mean |    Error |    StdDev |
|------------------- |---------:|---------:|----------:|
|      TestArrayCopy | 630.9 us | 47.01 us | 138.61 us |
| TestSingleSpanCopy | 629.5 us | 46.83 us | 138.08 us |
| TestDoubleSpanCopy | 655.4 us | 47.23 us | 139.25 us |
|         BufferCopy | 419.0 us |  3.31 us |   2.93 us |

When the arrayLength is 8 or 16, the Span<T>.CopyTo() is the fastest. When the arrayLength is 32 or 128, the first three way are almost the same and all faster than Buffer.BlockCopy.Ehen the arrayLength is 1024, however, the Span<T>.CopyTo and T[].CopyTo are again faster than the other two ways. When the arrayLength is 16384, these four ways are almost the same. But when the arrayLength is 65536, the Buffer.BlockCopy is the fastest! Besides, the Span<T>.CopyTo here is a bit slower than the first two ways.

I really can't understand the results. At first I guess it's the cpu cache that matters. However, the L1 Cache of my CPU is 960KB, which is larger than the space of the array of any test case. Maybe it's the different implementation that causes this?

I will appreciate it if you are willing to explain it for me or discuss with me. I will also think about it and update the question if I get an idea.


As @Ralf mentioned, the source and destination of the array in each time are all the same, which could impact on the results. I modified my code and tried the test again, as is shown below. To avoid the time consume, I just declare a new array each time instead of randomize it manually.

using System.Buffers;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args)
    {
        var summary = BenchmarkRunner.Run(typeof(Program).Assembly);
        Console.WriteLine(summary);
    }
}
public class UnitTest1
{
    static readonly int times = 1000;
    static readonly int arrayLength = 8;
    public static int[] GetRandomArray(int length)
    {
        int[] array = new int[length];
        //for (int i = 0; i < length; i++)
        //{
        //    array[i] = new Random(DateTime.Now.Millisecond).Next(int.MinValue, int.MaxValue);
        //}
        return array;
    }

    [Benchmark]
    public void ArrayCopy()
    {
        for (var j = 0; j < times; j++)
        {
            int[] src = GetRandomArray(arrayLength);
            int[] dst = new int[arrayLength];
            src.CopyTo(dst, 0);
        }
    }
    [Benchmark]
    public void SingleSpanCopy()
    {
        for (var j = 0; j < times; j++)
        {
            int[] src = GetRandomArray(arrayLength);
            int[] dst = new int[arrayLength];
            src.CopyTo(dst.AsSpan());
        }
    }
    [Benchmark]
    public void DoubleSpanCopy()
    {
        for (var j = 0; j < times; j++)
        {
            int[] src = GetRandomArray(arrayLength);
            int[] dst = new int[arrayLength];
            src.AsSpan().CopyTo(dst.AsSpan());
        }
    }
    [Benchmark]
    public void BufferCopy()
    {
        for (var j = 0; j < times; j++)
        {
            int[] src = GetRandomArray(arrayLength);
            int[] dst = new int[arrayLength];
            System.Buffer.BlockCopy(src, 0, dst, 0, sizeof(int) * src.Length);
        }
    }
}

times = 1000, arrayLength = 8

|         Method |      Mean |     Error |    StdDev |    Median |
|--------------- |----------:|----------:|----------:|----------:|
|      ArrayCopy |  8.843 us | 0.1762 us | 0.3040 us |  8.843 us |
| SingleSpanCopy |  6.864 us | 0.1366 us | 0.1519 us |  6.880 us |
| DoubleSpanCopy | 10.543 us | 0.9496 us | 2.7999 us | 10.689 us |
|     BufferCopy | 21.270 us | 1.3477 us | 3.9738 us | 22.630 us |

times = 1000, arrayLength = 16

|         Method |     Mean |    Error |   StdDev |   Median |
|--------------- |---------:|---------:|---------:|---------:|
|      ArrayCopy | 16.94 us | 0.952 us | 2.808 us | 17.27 us |
| SingleSpanCopy | 12.54 us | 1.054 us | 3.109 us | 12.32 us |
| DoubleSpanCopy | 13.23 us | 0.930 us | 2.741 us | 13.25 us |
|     BufferCopy | 23.43 us | 1.218 us | 3.591 us | 24.99 us |

times = 1000, arrayLength = 32

|         Method |     Mean |    Error |   StdDev |   Median |
|--------------- |---------:|---------:|---------:|---------:|
|      ArrayCopy | 24.35 us | 1.774 us | 5.229 us | 26.23 us |
| SingleSpanCopy | 20.64 us | 1.726 us | 5.089 us | 21.09 us |
| DoubleSpanCopy | 19.97 us | 1.915 us | 5.646 us | 20.08 us |
|     BufferCopy | 26.24 us | 2.547 us | 7.511 us | 24.59 us |

times = 1000, arrayLength = 128

|         Method |     Mean |    Error |   StdDev |
|--------------- |---------:|---------:|---------:|
|      ArrayCopy | 39.11 us | 0.529 us | 0.495 us |
| SingleSpanCopy | 39.14 us | 0.782 us | 1.070 us |
| DoubleSpanCopy | 40.24 us | 0.798 us | 1.398 us |
|     BufferCopy | 42.20 us | 0.480 us | 0.426 us |

times = 1000, arrayLength = 1024

|         Method |     Mean |   Error |  StdDev |
|--------------- |---------:|--------:|--------:|
|      ArrayCopy | 254.6 us | 4.92 us | 8.87 us |
| SingleSpanCopy | 241.4 us | 2.98 us | 2.78 us |
| DoubleSpanCopy | 243.7 us | 4.75 us | 4.66 us |
|     BufferCopy | 243.0 us | 2.85 us | 2.66 us |

times = 1000, arayLength = 16384

|         Method |     Mean |     Error |    StdDev |
|--------------- |---------:|----------:|----------:|
|      ArrayCopy | 4.325 ms | 0.0268 ms | 0.0250 ms |
| SingleSpanCopy | 4.300 ms | 0.0120 ms | 0.0112 ms |
| DoubleSpanCopy | 4.307 ms | 0.0348 ms | 0.0325 ms |
|     BufferCopy | 4.293 ms | 0.0238 ms | 0.0222 ms |

times = 100, arrayLength = 65536

|         Method |     Mean |    Error |   StdDev |   Median |
|--------------- |---------:|---------:|---------:|---------:|
|      ArrayCopy | 153.6 ms |  1.46 ms |  1.29 ms | 153.1 ms |
| SingleSpanCopy | 213.4 ms |  8.78 ms | 25.87 ms | 218.2 ms |
| DoubleSpanCopy | 221.2 ms |  9.51 ms | 28.04 ms | 229.7 ms |
|     BufferCopy | 203.1 ms | 10.92 ms | 32.18 ms | 205.6 ms |

@Ralf is right, there is indeed some differences. The most significant one is that when arrayLength = 65536, Array.Copy instead of Buffer.BlockCopy is the fastest.

But still, the results are very confusing..

Rinne
  • 53
  • 4
  • Check this post [What you need to use in which cases](https://stackoverflow.com/a/29263914/4423545) – Reza Heidari Apr 13 '22 at 11:22
  • Before thinking about the hardware impact you should think about the framework impact(But still all results will only be valid for your actual setup and can't be moved to a different machine and thinking it will behave the same). What i see is that you override the destination array multiple times. So the previous now no longer referenced array runs into maintenance of the GC. And different sized objects are handled differently. So parts of the measured times might be attributed to the framework and your test setup and not the actual method under test. – Ralf Apr 13 '22 at 11:47
  • Also you shouldn't do the same thing over and over in the same context. You should at least look at the first time you call the method under test individually. Redoing things might behave differently (Caching effects for example). – Ralf Apr 13 '22 at 11:48
  • @Ralf the OP is using BenchmarkDotnet which addresses these issues. – Matthew Watson Apr 13 '22 at 11:59
  • As far as I can see from looking at [the `Span.CopyTo()` source](https://source.dot.net/#System.Private.CoreLib/Span.cs,d2517139cac388e8), it's implemented using `Buffer.memmove()`, which is also used by `Buffer.BlockCopy()`, so it's mysterious that the performance is so different. – Matthew Watson Apr 13 '22 at 12:22
  • @MatthewWatson Sure? I can't think of how a framework can remove the impact of having a loop inside the testmethod and its implication on the overwritten references and GC'ing. Or does that benchmarking framework disable the GC somehow while benchmarking a method? – Ralf Apr 13 '22 at 12:50
  • @Ralf I don't know how it works, but it does generate code and suchlike and run things multiple times - it's pretty complicated. – Matthew Watson Apr 13 '22 at 12:54
  • @Ralf But I agree, it would be good to change those benchmarks to avoid the GC (by creating a single buffer once only for all the tests) – Matthew Watson Apr 13 '22 at 13:03
  • @Rinne You maybe want to retry your test with a predefined size of the destination array. Objects of different sizes life in different heaps. And your last test presumably uses the large object heap while the other test use the normal heap (i think the boundary was 85KB but may be framework/system dependant) So if you run all test while using the same heap and your result are more in line with your expectations you at least know about one factor then. – Ralf Apr 13 '22 at 13:07
  • @Ralf You're correct I think - I tried the test with a static single array, and the times are almost identical for the tests. – Matthew Watson Apr 13 '22 at 13:20
  • You could also check this post [Array.Copy vs Buffer.BlockCopy](https://stackoverflow.com/a/33865267/18307736) – Gabriel Apr 13 '22 at 15:53
  • @Ralf The location of those arrays in memory should not affect this benchmark. Two 64k int arrays (so 512kB in total) are small enough to fit in OP's CPU's L1 cache, so this benchmark will most likely not be affected by RAM. – Petrusion Apr 13 '22 at 22:52
  • Thank you all! I will try the test again later and update the description. @Ralf Do you mean that I should create a new array as the destination every times? I understand that if I visit the array again and again during a short time, it's actually in the cpu cache instead of the memory. But I also worry that if I create a new array each time, the GC will affect the result. – Rinne Apr 14 '22 at 03:40
  • @Petrusion: The OP wrote *the L1 Cache of my CPU is 960KB*, but that only makes any sense if they were multi-threading to take advantage of all the per-core L1d caches on every core. It makes very little sense to add up the total sizes of per-core private caches across cores. Alder Lake has 48kiB L1d caches on each P core, 32k on each E core. (And 1.25MiB L2 per P core, 2M per group of 4 E-cores). So even the larger copies will get L2 hits, which is pretty good bandwidth. – Peter Cordes Apr 14 '22 at 12:14
  • The version of the benchmark that randomizes and copies is now basically benchmarking the RNG. Copying is trivially cheap compared to that, unless deallocation actually gives it back to the OS so each fresh allocation has to get new memory from the OS and do a page fault. Your benchmark times are about 1000x slower for the 64k version, vs. 10x to 4x slower for the other versions, so that might be what's happening. – Peter Cordes Apr 14 '22 at 12:21
  • @PeterCordes Ah, yes, of course, my bad, that was stupid of me. – Petrusion Apr 14 '22 at 17:01
  • @PeterCordes Yes, I think you are right, the second benchmark time is obviously sereval times longer than the first. How should I do this if I want to avoid this? If I just use the first version of the code, as Ralf mentioned, the destination of each time is the same, which may impact on the result. – Rinne Apr 15 '22 at 13:22
  • Are you trying to measure the cache-hit case or not? If not, maybe copy to different parts of the destination, so it's like a big copy but done in multiple small copies. There is no one single number that's meaningful for all cache-cold cases; real-world use-cases differ in whether you'll get a page fault or not, and in what level of cache the src and dst are hot in, or none at all. And for tiny copies, what surrounding code can overlap execution with it. So if you want to model some scenario, you have to pick a specific one. – Peter Cordes Apr 15 '22 at 13:25

1 Answers1

1

Are you sure you can repeat the same benchmark and get the same results? Perhaps it was just a one time occurence, maybe caused by heat issues or another app taking processor time. When I try it on my machine, the values I get are more in line with what you'd expect.

It says Windows 10 for some reason, I'm using 11 too.

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
11th Gen Intel Core i9-11980HK 2.60GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.201
  [Host]     : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
  DefaultJob : .NET 6.0.3 (6.0.322.12309), X64 RyuJIT


|             Method |     Mean |   Error |  StdDev |
|------------------- |---------:|--------:|--------:|
|      TestArrayCopy | 466.6 us | 0.69 us | 0.61 us |
| TestSingleSpanCopy | 444.7 us | 1.07 us | 1.00 us |
| TestDoubleSpanCopy | 443.8 us | 0.62 us | 0.52 us |
|         BufferCopy | 447.1 us | 7.28 us | 6.08 us |

Just before posting this, I realized: Your CPU, 12700, has performance and efficiency cores. What if it ran most of the benchmark on efficiency cores and just so happened to run the BufferCopy part on performance cores? Can you try disabling your efficiency cores in BIOS?

Petrusion
  • 940
  • 4
  • 11
  • 1
    Instead of disabling E-cores, another approach would be to set an affinity mask that only includes P-cores. (Or only E-cores to benchmark those.) – Peter Cordes Apr 14 '22 at 12:11