Forward/Backward loop performance analysis

Question

I've been investigating some performance issues for an Event Viewer application that we have on our development site when I noticed an interesting issue in the algorithm. I then created a simplified test project to test two different algorithms. This program basically retrieves Windows Event Logs using the EventLog class, and then translates those logs into queryable EventLogItem entities.

This operation is performed and timed using two different loops. The first (backward) loop starts at the index of the last item in the list, translates the item and then decreases the index. The method is defined like this:

private static void TranslateLogsUsingBackwardLoop()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    var originalLogs = EventLog.GetEventLogs();
    var translatedLogs = new List<EventLogItem>();

    Parallel.ForEach<EventLog>(originalLogs, currentLog =>
    {
        for (int index = currentLog.Entries.Count - 1; index >= 0; index--)
        {
            var currentEntry = currentLog.Entries[index];

            EventLogItem translatedEntry = new EventLogItem
            {
                MachineName = currentEntry.MachineName,
                LogName = currentLog.LogDisplayName,
                CreatedTime = currentEntry.TimeGenerated,
                Source = currentEntry.Source,
                Message = currentEntry.Message,
                Number = currentEntry.Index,
                Category = currentEntry.Category,
                Type = currentEntry.EntryType,
                InstanceID = currentEntry.InstanceId,
                User = currentEntry.UserName,
            };

            lock (translatedLogs)
            {
                translatedLogs.Add(translatedEntry);
            }
        }
    });

    stopwatch.Stop();

    Console.WriteLine("{0} logs were translated in {1} using backward loop.", translatedLogs.Count, stopwatch.Elapsed);
}

The second (forward) loop starts at index 0 and increments the index. This method is defined like this:

private static void TranslateLogsUsingForwardLoop()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    var originalLogs = EventLog.GetEventLogs();
    var translatedLogs = new List<EventLogItem>();

    Parallel.ForEach<EventLog>(originalLogs, currentLog =>
    {
        for (int index = 0; index < currentLog.Entries.Count; index++)
        {
            var currentEntry = currentLog.Entries[index];

            EventLogItem translatedEntry = new EventLogItem
            {
                MachineName = currentEntry.MachineName,
                LogName = currentLog.LogDisplayName,
                CreatedTime = currentEntry.TimeGenerated,
                Source = currentEntry.Source,
                Message = currentEntry.Message,
                Number = currentEntry.Index,
                Category = currentEntry.Category,
                Type = currentEntry.EntryType,
                InstanceID = currentEntry.InstanceId,
                User = currentEntry.UserName,
            };

            lock (translatedLogs)
            {
                translatedLogs.Add(translatedEntry);
            }
        }
    });

    stopwatch.Stop();

    Console.WriteLine("{0} logs were translated in {1} using forward loop.", translatedLogs.Count, stopwatch.Elapsed);
}

And the main method:

static void Main(string[] args)
{
    TranslateLogsUsingForwardLoop();
    Console.WriteLine();
    Thread.Sleep(2000);
    TranslateLogsUsingBackwardLoop();
    Console.ReadLine();
}

This is what I get (performed this test several times, and results are almost identical):

enter image description here

Note that the server I tested this on logs to the Event Log every second, that's why the number of translated logs are not the same. So why is the backward loop faster? I initially thought it's because in the backward loop algorithm, currentLog.Entries.Count is evaluated just once, where as in the forward loop it needs to be calculated and compared against index on every loop iteration, but then again that doesn't seem right. Any ideas?

Two comments. When timing always include a burn-in loop. Do some work before the timing starts in order to get the processors warmed up (throttled correctly). And you should start the timer right before the `For` loop and not time the assignment operators. — John Alexiou, Aug 24 '13 at 14:09
How many entries are there in your tests? Also have you tried with regular (sequential) loop? — John Alexiou, Aug 24 '13 at 14:13

score 2 · Answer 1 · answered Jun 01 '16 at 15:28

Old question and this may not be exact reason in this case but there's a difference when the loops get down to the IL or assembly or whatever your language's lower language happens to be. At a very minimum with the normal for loop you're getting the count value and then doing a comparison of you index variable against that on every loop. In a reverse loop you get the count once as your starting point and then the comparison is always against 0 which is easier to compare and the compilers can even optimize for. Your mileage may vary though and depending on the rest of the code it may be a negligible difference. but if you need every clock cycle reverse loops are awesome.

score 1 · Answer 2 · answered May 18 '22 at 19:15

There is substantially zero difference between forwards and backwards iteration of List<T> and T[]. However, there is substantial difference between caching and not caching the loop boundary, and backwards iteration automatically effectively caches the loop boundary (it takes unnatural effort to avoid this).

I've included a benchmark and a result. It is evident that the *BoundaryCached* tests are respectively equally fast while the *BoundaryUncached* tests are markedly slower. This result is consistent for sizes of 1, 5, 10, 100, 1,000, and 800,000. My CPU has a 512 KiB L1 cache, in which 800,000 32 bit integers will not fit.

BenchmarkDotNet=v0.13.1, OS=opensuse-leap 15.3
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.202
  [Host]     : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT
  DefaultJob : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT

Method	items	Mean	Error	StdDev	Ratio
For_List_BoundaryUncached_Forwards	10	37.1188 ns	0.1784 ns	0.1582 ns	1.00
For_List_BoundaryCached_Forwards	10	21.4687 ns	0.0858 ns	0.0717 ns	0.58
For_List_BoundaryCached_Backwards	10	21.6920 ns	0.2395 ns	0.2241 ns	0.58
For_Array_BoundaryUncached_Forwards	10	5.6312 ns	0.0164 ns	0.0145 ns	0.15
For_Array_BoundaryCached_Forwards	10	4.2638 ns	0.0672 ns	0.0629 ns	0.11
For_Array_BoundaryCached_Backwards	10	4.6106 ns	0.0338 ns	0.0316 ns	0.12

For_List_BoundaryUncached_Forwards	1000	3,602.5782 ns	11.2947 ns	10.5650 ns	1.00
For_List_BoundaryCached_Forwards	1000	1,922.8651 ns	14.5266 ns	13.5882 ns	0.53
For_List_BoundaryCached_Backwards	1000	1,911.4424 ns	8.7750 ns	8.2081 ns	0.53
For_Array_BoundaryUncached_Forwards	1000	469.6596 ns	0.8472 ns	0.6614 ns	0.13
For_Array_BoundaryCached_Forwards	1000	469.5290 ns	0.5630 ns	0.4991 ns	0.13
For_Array_BoundaryCached_Backwards	1000	462.3722 ns	1.1469 ns	1.0728 ns	0.13

For_List_BoundaryUncached_Forwards	800000	2,841,438.8663 ns	19,331.9630 ns	16,143.0646 ns	1.00
For_List_BoundaryCached_Forwards	800000	1,525,126.0399 ns	8,127.2104 ns	7,204.5603 ns	0.54
For_List_BoundaryCached_Backwards	800000	1,544,717.2824 ns	5,226.2047 ns	4,888.5952 ns	0.54
For_Array_BoundaryUncached_Forwards	800000	373,252.2978 ns	1,450.3420 ns	1,356.6508 ns	0.13
For_Array_BoundaryCached_Forwards	800000	373,244.7567 ns	933.6585 ns	827.6639 ns	0.13
For_Array_BoundaryCached_Backwards	800000	373,297.0415 ns	705.1777 ns	659.6236 ns	0.13

// MIT or Apache-2.0
using BenchmarkDotNet.Attributes;
using System.Diagnostics;

public class LoopForwardsBackwards
{
    private IList<int>? valuesList;
    private int[]? valuesArray;

    [Params(10, 1000, 800000)]
    public int items;

    [GlobalSetup]
    public void GlobalSetup()
    {
        var numbers = Enumerable.Range(0, items).ToList();

        valuesList = numbers;
        valuesArray = valuesList.ToArray();
    }

    [Benchmark(Baseline = true)]
    public int For_List_BoundaryUncached_Forwards()
    {
        Debug.Assert(valuesList is not null);
        int n = 0;
        for (int i = 0; i < valuesList.Count; ++i)
            n += valuesList[i];
        return n;
    }

    [Benchmark]
    public int For_List_BoundaryCached_Forwards()
    {
        Debug.Assert(valuesList is not null);
        int n = 0;
        int len = valuesList.Count;
        for (int i = 0; i < len; ++i)
            n += valuesList[i];
        return n;
    }

    [Benchmark]
    public int For_List_BoundaryCached_Backwards()
    {
        Debug.Assert(valuesList is not null);
        int n = 0;
        for (int i = valuesList.Count - 1; i >= 0; --i)
            n += valuesList[i];
        return n;
    }

    [Benchmark]
    public int For_Array_BoundaryUncached_Forwards()
    {
        Debug.Assert(valuesArray is not null);
        int n = 0;
        for (int i = 0; i < valuesArray.Length; ++i)
            n += valuesArray[i];
        return n;
    }

    [Benchmark]
    public int For_Array_BoundaryCached_Forwards()
    {
        Debug.Assert(valuesArray is not null);
        int n = 0;
        int len = valuesArray.Length;
        for (int i = 0; i < len; ++i)
            n += valuesArray[i];
        return n;
    }

    [Benchmark]
    public int For_Array_BoundaryCached_Backwards()
    {
        Debug.Assert(valuesArray is not null);
        int n = 0;
        for (int i = valuesArray.Length - 1; i >= 0; --i)
            n += valuesArray[i];
        return n;
    }
}

Gary Walker · Accepted Answer · 2013-08-23T21:54:47.957

Testing againt 0 versus the maxndex is likely to have little effect. However, performing test1 then test2 shortly thereafter often has an effect due to processor caching and/or O/S page caching. You might reverse the test1/test2 to see if forwards magically becomes faster than backwards. Accurate profiling is hard on modern achitectures.

OK, so Backwards is still Faster when executed first. Not my first guess, but since you are using Parallel and lock, there is perhaps an interaction between the locking method and the difference between forward and backward looping.

Maybe the backwards loop just happens to work better with processor branch prediction (again may interact with parallism, processor cache, etc.).

Lots of tight loops in multi-thread code have weird interactions with memory management because of locking overhead. -- It is not even uncommon for the multi-thread solution to be slower because of lock contention

You can try running without parallel both forward and back to see if time becomes more even -- but at best you would only determine it to be likely / unlikely relation to parallel interactions or lock contention. Profiling your code may be suggestive, but it may may not procide a definite answer either. A definitive answer can be quite difficult for this situation (I assumed you were mostly in curious/learning mode).

I don't see that much difference between the two algorithms when I use a regular loop instead of a parallel loop... Still not quite sure why it's behaving like this. — Arian Motamedi, Aug 26 '13 at 17:19
Acutally, that is the key to the answer. If the performance is basically same forward & backwards it is almost sure to be either an interaction with memory management locking, or some other locking behaviour. Rooting out the exact nature of the interation would be quite difficult as well as depending upon your environment. — Gary Walker, Aug 29 '13 at 17:41

score 0 · Answer 4 · edited May 23 '17 at 12:30

The first loop is slower because it's first, not because it's forwards.

Caching

Modern CPUs cache data (in Level 1 and Level 2 caches). This is slow the first time the data gets accessed, then faster for subsequent accesses.

   var currentEntry = currentLog.Entries[index];

The first loop will take longer, because its loading from slow RAM into L2 cache.

I would expect the second loop to be faster, regardless of how it's written, because its loading from L2 cache.

List<T>

Lists are ever-expanding arrays. They start off small (capacity 4) then double their capacity as required. Each reallocation is very slow.

  var translatedLogs = new List<EventLogItem>();
  ...

  translatedLogs.Add(translatedEntry);

The first loop will reallocate fairly often: 4, 8, 16, 32, 64

The second loop will reallocate less often: 64, 128

So you would expect the second loop (regardless of how its written) to be faster.

CPU optimizations

Weird things happen because processors are so complex. You can no longer predict code speed the way we used to in the old days :-)

Why is processing a sorted array faster than an unsorted array?

Forward/Backward loop performance analysis

4 Answers4