17

I have some 3d interpolation code that takes up 90% of my projects runtime and cannot be precomputed.

What are some techniques that I could use to speed this up? Algorithmic or Micro Optimization?

Here is the code for those interested.

It basically takes data that was placed across the 2 3d arrays and interpolates the rest of the data.

EDIT: Also I am already spliting this into threads at a higher level for increased performance, but this doesn't help on the windows phone as they are all single core...

I will probably do something like (Single[] DensityMap = new Single[128 * 128 * 128];) to remove the multi-D array hit. I access the array in 100's of places and was hoping to not have to do that (wrapping in a function doesn't help as the windows phone won't inline the function call and it doesn't help perf then...)

float[, ,] DensityMap = new float[128, 128, 128];
float[, ,] PressureMap = new float[128, 128, 128];

unchecked
{
    for (int x = 0; x < g_CraftWorldConstants.RegionSizeX; x++)
    {
        int offsetX = (x / SAMPLE_RATE_3D_HOR) * SAMPLE_RATE_3D_HOR;
        int plusOffsetX = SAMPLE_RATE_3D_HOR + offsetX;
        int poxox = plusOffsetX - offsetX;
        double poxxpoxox = ((plusOffsetX - x) / (double)poxox);
        double xoxpoxox = ((x - offsetX) / (double)poxox);

        for (int y = 0; y < g_CraftWorldSettings.GET.RegionSizeY; y++)
        {
            int offsetY = (y / SAMPLE_RATE_3D_VERT) * SAMPLE_RATE_3D_VERT;
            int plusOffsetY = SAMPLE_RATE_3D_VERT + offsetY;
            int poyoy = plusOffsetY - offsetY;
            double poyypoyoy = ((plusOffsetY - y) / (double)poyoy);
            double yoypoyoy = ((y - offsetY) / (double)poyoy);

            for (int z = 0; z < g_CraftWorldConstants.RegionSizeZ; z++)
            {
                if (!(x % SAMPLE_RATE_3D_HOR == 0 && y % SAMPLE_RATE_3D_VERT == 0 && z % SAMPLE_RATE_3D_HOR == 0))
                {
                    int offsetZ = (z / SAMPLE_RATE_3D_HOR) * SAMPLE_RATE_3D_HOR;
                    int plusOffsetZ = SAMPLE_RATE_3D_HOR + offsetZ;
                    int pozoz = plusOffsetZ - offsetZ;
                    double pozzpozoz = ((plusOffsetZ - z) / (double)pozoz);
                    double zozpozoz = ((z - offsetZ) / (double)pozoz);

                    double x00 = poxxpoxox * in_DensityMap[offsetX, offsetY, offsetZ] + xoxpoxox * in_DensityMap[plusOffsetX, offsetY, offsetZ];
                    double x10 = poxxpoxox * in_DensityMap[offsetX, offsetY, plusOffsetZ] + xoxpoxox * in_DensityMap[plusOffsetX, offsetY, plusOffsetZ];
                    double x01 = poxxpoxox * in_DensityMap[offsetX, plusOffsetY, offsetZ] + xoxpoxox * in_DensityMap[plusOffsetX, plusOffsetY, offsetZ];
                    double x11 = poxxpoxox * in_DensityMap[offsetX, plusOffsetY, plusOffsetZ] + xoxpoxox * in_DensityMap[plusOffsetX, plusOffsetY, plusOffsetZ];

                    double r0 = poyypoyoy * x00 + yoypoyoy * x01;
                    double r1 = poyypoyoy * x10 + yoypoyoy * x11;
                    in_DensityMap[x, y, z] = (float)(pozzpozoz * r0 + zozpozoz * r1);

                    double x02 = poxxpoxox * in_CaveDensity[offsetX, offsetY, offsetZ] + xoxpoxox * in_CaveDensity[plusOffsetX, offsetY, offsetZ];
                    double x12 = poxxpoxox * in_CaveDensity[offsetX, offsetY, plusOffsetZ] + xoxpoxox * in_CaveDensity[plusOffsetX, offsetY, plusOffsetZ];
                    double x03 = poxxpoxox * in_CaveDensity[offsetX, plusOffsetY, offsetZ] + xoxpoxox * in_CaveDensity[plusOffsetX, plusOffsetY, offsetZ];
                    double x13 = poxxpoxox * in_CaveDensity[offsetX, plusOffsetY, plusOffsetZ] + xoxpoxox * in_CaveDensity[plusOffsetX, plusOffsetY, plusOffsetZ];

                    double r2 = poyypoyoy * x02 + yoypoyoy * x03;
                    double r3 = poyypoyoy * x12 + yoypoyoy * x13;
                    in_CaveDensity[x, y, z] = (float)(pozzpozoz * r2 + zozpozoz * r3);
                }
            }
        }
    }
}
Daniel Armstrong
  • 829
  • 9
  • 22
  • 1
    Maybe you could aproximate - for instance, take only 1/10 of the values on each axis, and interpolate the missing values. – Oliver May 25 '12 at 12:29
  • `Parallel.For` could speed things up a lot – cjk May 25 '12 at 12:30
  • If you go for parallelization, make sure you only parallelize at one level—probably the outer for statement. So you would have two sequential for loops insde one parallel for loop. See http://msdn.microsoft.com/en-us/library/dd997392.aspx for why. – Olly May 25 '12 at 12:44
  • You have some calculations which be simplified (I'm sure the compiler does this anyway, hence it won't help performance), e.g. `(x / SAMPLE_RATE_3D_HOR) * SAMPLE_RATE_3D_HOR` can simply become `x`, this can then eliminate some redundant variables. – Lukazoid May 25 '12 at 12:48
  • @Lukazoid-those are int calculations, so they may not be able to be simplified, i.e. (1234 / 100) * 100 = 1200. – hatchet - done with SOverflow May 25 '12 at 12:58

5 Answers5

16

It seems that you have a lot of opportunities to optimize your code. Your x loop executes 128 times, your y loop executes 128*128=16,384 times, and your z loop executes 128^3=2,097,152 times. There are a number of terms inside your z loop that are only dependent on the x, or the y iterations, but they are recalculated at every z iteration. For example,

int poxox = plusOffsetX - offsetX;

and

double poxxpoxox = ((plusOffsetX - x) / (double)poxox);

These two terms are being calculated more than 2 million times, but only need to be calculated 128 times if my cursory scan of your function is correct. Move terms to the loop level that is appropriate so you don't waste cycles recalculating the same values many multiple times.

Here is your code with basic optimizations made. I'm curious to know how this affects your run times. Several of the terms are only dependent on the iteration value, and are the same for x, y, and z. So I pulled them out entirely and precompute them once. I also have moved the outer mod operations out of the inner loop, and modified the logic to ensure short circuit of the evaluation, which should should remove the majority of mod operations that were previously being executed.

int[] offsets = new int[128];
int[] plusOffsets = new int[128];
double[] poii = new double[128];
double[] ioip = new double[128];
for (int i = 0; i < 128; i++) {
    offsets[i] = (i / SAMPLE_RATE_3D_HOR) * SAMPLE_RATE_3D_HOR;
    plusOffsets[i] = SAMPLE_RATE_3D_HOR + offsets[i];
    double poioi = (double) (plusOffsets[i] - offsets[i]);
    poii[i] = ((plusOffsets[i] - i) / poioi);
    ioip[i] = ((i - offsets[i]) / poioi);
}

float[, ,] DensityMap = new float[128, 128, 128];
float[, ,] PressureMap = new float[128, 128, 128];

for (int x = 0; x < g_CraftWorldConstants.RegionSizeX; x++)
{
    int offsetX = offsets[x];
    int plusOffsetX = plusOffsets[x];
    double poxxpoxox = poii[x];
    double xoxpoxox = ioip[x];
    bool xModNot0 = !(x % SAMPLE_RATE_3D_HOR == 0);

    for (int y = 0; y < g_CraftWorldConstants.RegionSizeY; y++)
    {
        int offsetY = offsets[y];
        int plusOffsetY = plusOffsets[y];
        double poyypoyoy = poii[y];
        double yoypoyoy = ioip[y];
        bool yModNot0 = !(y % SAMPLE_RATE_3D_VERT == 0);

        for (int z = 0; z < g_CraftWorldConstants.RegionSizeZ; z++)
        {
            //if (!(x % SAMPLE_RATE_3D_HOR == 0 && y % SAMPLE_RATE_3D_VERT == 0 && z % SAMPLE_RATE_3D_HOR == 0))
            if (xModNot0 || yModNot0 || !(z % SAMPLE_RATE_3D_HOR == 0))
            {
                int offsetZ = offsets[z];
                int plusOffsetZ = plusOffsets[z];
                double pozzpozoz = poii[z];
                double zozpozoz = ioip[z];

                double x00 = poxxpoxox * DensityMap[offsetX, offsetY, offsetZ] + xoxpoxox * DensityMap[plusOffsetX, offsetY, offsetZ];
                double x10 = poxxpoxox * DensityMap[offsetX, offsetY, plusOffsetZ] + xoxpoxox * DensityMap[plusOffsetX, offsetY, plusOffsetZ];
                double x01 = poxxpoxox * DensityMap[offsetX, plusOffsetY, offsetZ] + xoxpoxox * DensityMap[plusOffsetX, plusOffsetY, offsetZ];
                double x11 = poxxpoxox * DensityMap[offsetX, plusOffsetY, plusOffsetZ] + xoxpoxox * DensityMap[plusOffsetX, plusOffsetY, plusOffsetZ];

                double r0 = poyypoyoy * x00 + yoypoyoy * x01;
                double r1 = poyypoyoy * x10 + yoypoyoy * x11;
                DensityMap[x, y, z] = (float)(pozzpozoz * r0 + zozpozoz * r1);

                double x02 = poxxpoxox * PressureMap[offsetX, offsetY, offsetZ] + xoxpoxox * PressureMap[plusOffsetX, offsetY, offsetZ];
                double x12 = poxxpoxox * PressureMap[offsetX, offsetY, plusOffsetZ] + xoxpoxox * PressureMap[plusOffsetX, offsetY, plusOffsetZ];
                double x03 = poxxpoxox * PressureMap[offsetX, plusOffsetY, offsetZ] + xoxpoxox * PressureMap[plusOffsetX, plusOffsetY, offsetZ];
                double x13 = poxxpoxox * PressureMap[offsetX, plusOffsetY, plusOffsetZ] + xoxpoxox * PressureMap[plusOffsetX, plusOffsetY, plusOffsetZ];

                double r2 = poyypoyoy * x02 + yoypoyoy * x03;
                double r3 = poyypoyoy * x12 + yoypoyoy * x13;
                PressureMap[x, y, z] = (float)(pozzpozoz * r2 + zozpozoz * r3);
            }
        }
    } 
}
  • 1
    It actually made things about 19% faster! this is a huge improvement :) I had previously moved many of those variables up and out of functions on the inside there but I missed all the ones you found, shows just how good another pair of eyes can be. SAMPLE_RATE_3D_HOR is an int too (someone asked), also all the stupid names for variables (like poxxpoxox) are because of resharpers inline function which I used to lazyly remove the 8 inner functions. Thanks this is really helpful, can you see any other things I could do to improve it? – Daniel Armstrong May 25 '12 at 15:05
  • @DanielArmstrong - I've edited my answer to include another optimization that I noticed right after I posted the previous version, but couldn't post until now because I had to drive somewhere. This should improve the speed further. You will need to test to ensure my changes don't bugger up the results. – hatchet - done with SOverflow May 25 '12 at 15:22
  • @DanielArmstrong - I've edited the answer to include one more little optimization to remove around 4 million mod (%) operations, which I think are kind of expensive if the right hand side is not a power of 2. – hatchet - done with SOverflow May 25 '12 at 18:42
  • 1
    this looks very promising, I will impliment this tomorrow and let you know how it goes. – Daniel Armstrong May 26 '12 at 02:25
8

There are some things you can do to speed up your code:

  • Avoid using multidim.-arrays because they are slow
  • Use multiple threads
  • Store variables which going to be cast as a double in a double variable
  • Precalculate everything you can ( see hatchet's post )

Arrays

To simulate the 3D-array, you just can do it this way:

Single[] DensityMap = new Single[128 * 128 * 128];
DensityMap[z + (y * 128) + (x * 128 * 128)] = ...;
Community
  • 1
  • 1
Felix K.
  • 6,201
  • 2
  • 38
  • 71
  • How can multiple threads help, on a single core cpu? It makes things even worse, because of the context switches and the synchronization issues. – dowhilefor May 25 '12 at 12:43
  • @dowhilefor He is also talking about the XBox360 which actually has 3 cores( Where each can handle 2 threads )! What synchonization issues you are talking about? – Felix K. May 25 '12 at 12:46
  • Sry, didn't saw about the xbox. Making sure not deadlocking your threads, if you use multiple. So if he actually can use multiple cores, than dealing with the synchronization problems might worth the effort, but i thought only about one core, which would perform much worse with multiple threads. Synchronization comes at a cost aswell. – dowhilefor May 25 '12 at 12:51
  • @dowhilefor I don't think you need to sync things up here. You can split the work here with ease and you are not dealing with data between this threads. – Felix K. May 25 '12 at 12:55
  • 1
    You are right. But often its worth to consider that this can arise. Anyway good answer +1 – dowhilefor May 25 '12 at 12:58
  • @Felix K. I am already spliting this into threads at a higher level for increased performance, but this doesn't help on the windows phone as they are all single core... – Daniel Armstrong May 25 '12 at 14:29
  • @Felix K. I will probably do something like (Single[] DensityMap = new Single[128 * 128 * 128];) to remove the multi-D array hit. I access the array in 100's of places and was hoping to not have to do that (wrapping in a function doesn't help as the windows phone won't inline the function call and it doesn't help perf then...) – Daniel Armstrong May 25 '12 at 14:30
2

Use a jagged array rather than multi-dimensional, i.e. do

float[][][] DensityMap = new float[128][][];

And then create the inner arrays using for loops, or the LINQ syntax (which might be sub-optimal).

This will give MUCH better performance than using a multi-dimensional array, and equal or better performance than using a single-dimensional array and calculating the offsets yourself. That is, unless the cost of initializing the jagged array is significant; it will after all create 128^2 arrays. I'd benchmark it and only revert to a single-dimensional array if the cost really is significant.

Community
  • 1
  • 1
Asik
  • 21,506
  • 6
  • 72
  • 131
  • +1. A simple benchmark I did not that long ago indicated that initialization of jagged array takes longer than initialization of multidimensional array (I don't remember how much longer), but the access to elements of jagged array was in between 15 - 20% faster. So indeed, in this case (initializing once, but accessing elements a lot) this can help, on top of the hints in other replies. – Patryk Ćwiek May 25 '12 at 15:54
  • Why are jagged arrays faster? They should be slower as the require more memory accesses. – mafu Jul 26 '20 at 03:10
  • It depends how multi-dimensional arrays are implemented. Back then, they didn't benefit from the same optimizations as single-dimensional arrays. This might have changed though, this answer is 8 years old. – Asik Jul 27 '20 at 18:40
1

You can change your for loops since you're not doing anything for the in-between values of all of these

for (int x = 0; x < 128; x+= SAMPLE_RATE_3D_HOR) {
   for (int y = 0; y < 128; y+= SAMPLE_RATE_3D_VERT) {
      for (int z = 0; z < 128; z+= SAMPLE_RATE_3D_HOR) {

Doing these in parallel would be even better.

With this you can eliminate the 6 million mod % calculations and 60+ thousand multiplies.

--edit-- Sorry, I missed the "!" on your line with the 3 mods. You can still skip some of those calculations. See comments below.

GeekyMonkey
  • 12,478
  • 6
  • 33
  • 39
  • Actually I already do, There is another W loop out side this function :) and that is threaded on the 360 among all 6 threads. – Daniel Armstrong May 25 '12 at 15:09
  • What do you mean by "With this you can eliminate the 6 million mod % calculations and 60+ thousand multiplies." – Daniel Armstrong May 25 '12 at 15:11
  • Sorry, I misread this line if (!(x % SAMPLE_RATE_3D_HOR == 0 && y % SAMPLE_RATE_3D_VERT == 0 && z % SAMPLE_RATE_3D_HOR == 0)) I missed the "!". Still you could eliminate some of those mod calculations on the innermost loop if you skipped the Y and Z loops where (X % SAMPLE_RATE_3D_HOR == 0) and skipped the z loop where (Y % SAMPLE_RATE_3D_VERT == 0) – GeekyMonkey May 25 '12 at 15:47
0

1) Do you really need doubles? Especially you are mixing some floats, doubles and ints.

2) You should precalculate the k / SAMPLE_RATE_3D_HOR * SAMPLE_RATE_3D_HOR pattern.

int pre_calc[128];
for( int i = 0; i < 128; ++i )
    pre_calc[i] = (i / SAMPLE_RATE_3D_HOR) * SAMPLE_RATE_3D_HOR;
Christopher
  • 8,912
  • 3
  • 33
  • 38
  • I was under the impression that on the windows 7 phone doubles were faster than floats by 40-50%? This is apparently because the clr casts floats to doubles anyway. – Daniel Armstrong May 25 '12 at 15:28
  • Doubles are slower. The clr promotes floats (and ints) in operators to double, when one of the parameters is an double. – Christopher May 26 '12 at 19:08
  • I just benchmarked a program on the phone hardware, math on doubles (addition, mult, sub, div) seems to be about 35% faster. I could be doing something wrong however, it could be something todo with the cache perhaps? – Daniel Armstrong May 27 '12 at 08:33