Unroll the loop 2-8 times. Measure which one is best. The .NET JIT optimizes poorly, so you have to do some of its work.
You'll probably have to add unsafe
as well because the JIT will now be unable to optimize out the array bounds checks.
You can also try to aggregate into multiple sum variables:
int sum1 = 0, sum2 = 0;
for (int i = 0; i < array.Length; i+=2) {
sum1 += array[i+0];
sum2 += array[i+1];
}
That might increase instruction-level parallelism because all add
instructions are now independent.
The i+0
is optimized to i
automatically.
I tested it and it shaved off about 30%.
The timings are stable when repeated. Code:
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
var watch = new Stopwatch();
int[] array = new int[500000000];
for (int i = 0; i < array.Length; i++)
{
array[i] = 1;
}
//warmup
{
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
}
for (int i2 = 0; i2 < 5; i2++)
{
{
watch.Restart();
int sum = 0;
for (int i = 0; i < array.Length; i++)
sum += array[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
}
{
watch.Restart();
fixed (int* ptr = array)
{
int sum = 0;
var length = array.Length;
for (int i = 0; i < length; i++)
sum += ptr[i];
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + sum);
}
}
{
watch.Restart();
fixed (int* ptr = array)
{
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
int sum4 = 0;
var length = array.Length;
for (int i = 0; i < length; i += 4)
{
sum1 += ptr[i + 0];
sum2 += ptr[i + 1];
sum3 += ptr[i + 2];
sum4 += ptr[i + 3];
}
Console.WriteLine("for loop:" + watch.ElapsedMilliseconds + "ms, result:" + (sum1 + sum2 + sum3 + sum4));
}
}
Console.WriteLine("===");
}
Further playing around, it turns out that multiple aggregation variables do nothing. Unrolling the loop did a major improvement, though. Unsafe did nothing (except in the unrolled case where it is pretty much required). Unrolling 2 times is as good as 4.
Running this on a Core i7.