0

What are the fastest possible iteration techniques in C# for the following scenario ?

Since im working on a small archetype based ECS in c#, i want to make use of cache efficient iterations for maximum performance. What could i do to make the iteration faster and get the maximum cache hits ?

       var chunks = archetype.Chunks; // Property that returns a Chunk[] array
        for (var chunkIndex = 0; chunkIndex < archetype.Size; chunkIndex++) {

            ref var chunk = ref chunks[chunkIndex];
            var transforms = chunk.GetArray<Transform>();  // Returns a Transform[] array
            var rotations = chunk.GetArray<Rotation>();    // Returns a Rotation[] array

            for (var index = 0; index < chunk.Capacity; index++) {

                ref var transform = ref transforms[index];
                ref var rotation = ref rotations[index];

                transform.x++;
                rotation.w++;
            }
        }

Details...


  public struct Transform{ float x; float y; }
  public struct Rotation{ float x; float y; float z; float w; }

  T[] (chunk).GetArray<T>(){
     return fittingTightlyPackedManagedArrayForT as T[]; // Pseudocode
  }

  int (chunk).Capcity{ get; set; }  // Just a property of how big each array is in the chunk, all having the same size

I already tested a unsafe variant to reduce the bound checks, however this increased the cache misses according to my benchmark and was only slightly faster ( not noticeable, not even for high amounts ).

What elese could i do to increase the iteration speed ? Glad for any feedback, techniques and tricks ! :)

genaray
  • 1,080
  • 1
  • 9
  • 30
  • 3
    Haunting for nanoseconds? Honestly? Usually I'd suggest to think about those micro-optimizations when you have a real performance-problem in the above code. Do you? Did you profile your application to verifiy this code is causing your bottleneck? – MakePeaceGreatAgain Oct 19 '22 at 10:39
  • https://wiki.c2.com/?PrematureOptimization but also more importantly https://ericlippert.com/2012/12/17/performance-rant/ – DavidG Oct 19 '22 at 10:40
  • 3
    @MakePeaceGreatAgain given that OP benchmarked and measured cache misses, why do you assume they didn't profile? – Good Night Nerd Pride Oct 19 '22 at 10:41
  • 1
    @GoodNightNerdPride fair enough, however not all benchmark-tests represent real-word scenarios. Of course I can show that I can make my function 10x faster by calling the code 1Mio times and measure. However that still does not mean it is neccessary, if the app just performs 10 calls instead of 1Mio. ones. – MakePeaceGreatAgain Oct 19 '22 at 10:43
  • Furthermor OP even mentions that the performance-decrease was "not noticable". So why even care? – MakePeaceGreatAgain Oct 19 '22 at 10:45
  • I benchmarked and its an ECS framework which needs to be high performant by default... so its actually nessecary. Besides that this post is also kinda educational since knowing the fastest possible iteration technique over arrays could always be usefull. – genaray Oct 19 '22 at 10:45
  • 1
    AFAIK in order to avoid cache misses one could place the data to iterate over into a continuous block in memory. For instance you could store all `Transform`s and `Rotation`s as two huge blocks of memory and then iterate over all of them at once instead of iterating them per chunk. – Good Night Nerd Pride Oct 19 '22 at 10:46
  • "Fastest possible iteration technique" is a moving target, unless you presume cache sizes, cache hierarchies, prefetching, branch prediction, etc. won't be different for some other or newer CPU family. –  Oct 19 '22 at 10:47
  • @GoodNightNerdPride Now this is what we are talking about :D Thats usefull, you could make a answer out of this. Im gonna try that, however... that means that only unmanaged structs are allowed right ? So no `struct{ string myName; }` ? – genaray Oct 19 '22 at 10:47
  • I don't know which ECS you are using, but if these properties are able to be changed on different thread, consider [Parallel](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=net-6.0) class. – shingo Oct 19 '22 at 10:55
  • That really depends on your data layout. Could you show us the definitions of `Transform`, `Rotation`, `Chunk.GetArray()`, and `Chunk.Capacity`? – Good Night Nerd Pride Oct 19 '22 at 10:59
  • @GoodNightNerdPride Done ^^ – genaray Oct 19 '22 at 11:14
  • Can different `Chunk`s have different values for `Capacity` or is it always the same? – Good Night Nerd Pride Oct 19 '22 at 11:16
  • Also `fittingTightlyPackedManagedArrayForT` kinda sounds like what I mean already. Could you elaborate on the pseudo code there? – Good Night Nerd Pride Oct 19 '22 at 11:18
  • @GoodNightNerdPride Different, thats why its a property. It basically shows how many entities are within that chunk ^^ How many times we need to iterate over it. – genaray Oct 19 '22 at 11:18
  • @GoodNightNerdPride Well each chunk contains several different arrays... like Transform[] or Rotation[]... or both or even more. Those are managed c# arrays. However i do not do the memory managed by my own and since they are managed, they also dont lie next to each other... my repo is here for further explanation :D https://github.com/genaray/Arch – genaray Oct 19 '22 at 11:20
  • One last detail: can `Chunk`s grow/shrink, i.e. can `Transform`s/`Rotation`s be added/removed from it during run time? Or will a `Chunk` always keep the `Transform`s/`Rotation`s it got when created for the duration of its lifetime? – Good Night Nerd Pride Oct 19 '22 at 11:24
  • Chunks have a fixed total capcity. Transform/Rotation can not be added/removed during runtime. Chunk keeps them, however they can be replaced/set new ^^ – genaray Oct 19 '22 at 11:28
  • 1
    Well that complicates things. I had no idea what an ECS is, so I googled and found [this](https://ajmmertens.medium.com/building-an-ecs-2-archetypes-and-vectorization-fe21690805f9). Looks exactly like what you could try (but its C++). – Good Night Nerd Pride Oct 19 '22 at 11:31

1 Answers1

1

A plain loop over an array or list is as fast as you can do iteration in c#, at least unless you have some special knowledge not available to the compiler. The compiler should recognize that you are looping over an array, and skip the bounds-check. And doing an linear iteration should allow the CPU to prefetch data before it is actually needed.

In your example I would not be certain the compiler could remove the bounds-checks, since the loop check is not against for the array length. So I would at least try changing it to two separate loops over the array instead.

I'm not sure why the unsafe version had lower cache hit rate, the cache is controlled by the CPU, not the compiler, and I would expect an unsafe version to produce very similar code to the compiler, at least with regards to memory access.

In some special cases it might be useful to manually unroll loops, but the compiler should be able to do this automatically, and this question suggest it is of little use. But compiler optimizations can be fickle, it might not always apply optimizations you expect it would, and what optimizations it applies might be different between versions, how long it is run, if you apply profile guided optimizations etc.

To get any real gains I would look at SIMD techniques, if you can process larger chunks of data you might get some very significant gains. But the gains might depend in large part on how the data is stored and accessed.

In some cases there can be major gains by using a structure of arrays (SoA) approach rather than the more common arrays of structures (AoS). In your example, if all the x and w values where stored in separate arrays you could just process the entire array in 128/256/512 bit SIMD blocks, and that would be fairly difficult to beat. This also has great cache efficiency, since you are not loading any unnecessary bytes. But using the SoA approach might have performance implications for other parts of the code.

JonasH
  • 28,608
  • 2
  • 10
  • 23
  • One quick question than... C++ and rust loop iterations similar to my c# code above are faster. I replicated this... why is this ? Is the JIT not that optimized yet ? – genaray Oct 19 '22 at 16:05
  • @genaray Any offline compiler has the freedom to spend however much time it wants on optimization, a JIT however needs to be real time. For a *simple* iteration over an array I would expect similar times, at least without auto vectorization. But I do not think your example falls into the 'simple' category, there is to much going on that could foil the c# optimizer. But they have introduced tiered compilation in c#, so that could allow the compiler to spend more time where it is needed. Assuming it is actually the iteration that takes time, and not some reflection that is not shown. – JonasH Oct 20 '22 at 07:15