Optimize a function in a search biggest square algorithm

Question

I'm kind of new to optimization and I'm in a friendly contest to code the fastest algorithm to find a square in a field with obstacles. I think my algorithm is pretty fast (2.07s in a 10k * 10k grid with 70% density) but I'd like to make it faster. My most used function is the one finding the max size for given coordinates in the square. Here it is :

int     max_size(int i, int j, int **tab, int offset)
{
int a;

a = offset;
if (i > 0 && j > 0)
{
    while (i + a < g_ymax && j + a < g_xmax && tab[i + a][j + a] - tab[i + a][j - 1] - tab[i - 1][j + a] + tab[i - 1][j - 1] == 0)
        a++;
    return (a);
}
else if (i > 0 && j == 0)
{
    while (i + a < g_ymax && j + a < g_xmax && tab[i + a][j + a] - tab[i - 1][j + a] == 0)
        a++;
    return (a);
}
else if (i == 0 && j > 0)
{
    while (i + a < g_ymax && j + a < g_xmax && tab[i + a][j + a] - tab[i + a][j - 1] == 0)
        a++;
    return (a);
}
else
{
    while (i + a < g_ymax && j + a < g_xmax && tab[i + a][j + a] == 0)
        a++;
    return (a);
}
}

So this is very ugly and I'd like to optimize it. Would make it an inline function help or not at all ? I tried to run some tests but I didn't see much differences.

Just inlining the function (eg by making it static) could easily double the speed. — joop, Aug 21 '18 at 10:29
I don't understand you comment, my code works fine, I'd just like to make it faster. — Abilbel, Aug 21 '18 at 12:22

score 2 · Answer 1 · answered Aug 21 '18 at 11:08

First of all, it doesn't make much sense to speak of performance or manual optimizations without a specific system in mind. Generally, in order to optimize better than the compiler for that system, you have to be an expert of that particular system.

But lets assume this is some mainstream high-end system like x86, PowerPC or ARM. What then matters most for performance by far, is how the grid is stored in memory.

You have int **tab and treat it as a 2D array, which is code smell. This most likely means that tab points at some pointer-based look-up table, probably allocated on the heap. Accessing such a look-up table will give significant performance loss on a system with data cache memory, since the segmentation of the data will block efficient data cache utilization. See Correctly allocating multi-dimensional arrays for advice how to radically improve performance in this regard.

Once you have moved the grid to contiguous memory, the next step is to ensure that you iterate over it in such a manner that the right-most dimension is the one changing the most frequently. That is, given array[i][j], make sure to iterate so that j is the iterator that changes most frequently. This is also related to cache memory use and possibly alignment as well.

You can gain a minor performance boost by reducing the number of comparisons, by for example using a truth table. The fewer comparisons, the better the branch prediction.

Inlining the function might indeed give a slight performance gain, though the compiler should be able to make a better decision there, as inlining comes at the expense of executable size and perhaps also memory use.

If 10k*10k is max, then a minor performance gain could be achieved by swapping int for uint_fast16_t, although that's mostly an optimization for 8 and 16 bit CPUs.

I'm losing 0.1s on average using your link on correctly allocating multi dimensional arrays. I'm already iterating on the columns, I'll the truth table. — Abilbel, Aug 21 '18 at 11:36
@Abilbel Then you are doing something wrong. How do you benchmark this? In my experience, 9 out 10 questions regarding performance on SO can be explained by incorrect benchmarking. — Lundin, Aug 21 '18 at 11:45
I have a random map generator, I generated 15 with the same density and compared the average for both options : 2.12s with look up tables and 2.31s without — Abilbel, Aug 21 '18 at 12:10
@Abilbel That's not what I asked - how do you do the actually clocking and which system is it? And without the code, there's just speculation. — Lundin, Aug 21 '18 at 12:26
Clocking is done using the time command. It is run on MacOs Sierra — Abilbel, Aug 21 '18 at 12:49
@Abilbel If that refers to standard C `time.h` then it may or may not be accurate enough. Anyway, trouble-shooting your benchmarking is a separate question of its own. — Lundin, Aug 21 '18 at 12:57
@Abilbel That's almost certainly not proper benchmarking then. You'll measure process creation, misc context switches and who knows what. With proper benchmarking you clock the actual algorithm, without any printing, and then run it over and over thousands of times. I know null about mac so I can't help you there. On Windows or Linux you have special benchmarking functions as part of the API. — Lundin, Aug 21 '18 at 13:03

SKi · Answer 2 · 2018-08-21T11:09:14.657

0

I am not sure about this, it might help compiler to keep those two values in registers:

int max_size(int i, int j, int **tab, int offset)
{
   int a;
   const int ymax = g_ymax;
   const int xmax = g_xmax;

   // Use ymax and xmax in the rest of code.

If g_ymax and g_xmax are constants like:

   static const int g_xmax = 1234;

then this won't help.

edited Aug 21 '18 at 11:09

answered Aug 21 '18 at 11:00

SKi

8,007
2
26
57

This won't matter at all. The compiler will put the variables in registers no matter what the programmer does. Not even the `register` keyword matters nowadays. Tricks like these are long since obsolete. – Lundin Aug 21 '18 at 11:10
@Lundin, Yes, you are right. tab is not modified in the loop. And probably those global variables are not volatiles. – SKi Aug 21 '18 at 11:51

score 0 · Answer 3 · answered Aug 21 '18 at 11:11

Use local variables in your function to simplify the conditions in your while loops. For example your first loop could be simplified as:

int   Limit;
int * C1=tab[i-1];
int   C2=C1[j - 1];

if (g_ymax-i < g_xmax-j)
{
  Limit = g_ymax-i;
}
else
{
  Limit = g_xmax-j;
}

while ((a < Limit) && (tab[i + a][j + a] - tab[i + a][j - 1] - C1[j + a] + C2 == 0) )
{
  a++;
}

return (a);

De-referencing a pointer also consumes clock cycles. That's why I've used C1, and C2 when the index of the array is constant.

score -1 · Answer 4 · answered Aug 21 '18 at 10:22

-1

we compare 'for(;;)' and 'while(1)', I read the corresponding assembly language codes:

'for(;;)', its assembly language doesn't have judgement sentence.

'while(1)', its assembly language has one judgement, to judge 1 is need to continue looping.

answered Aug 21 '18 at 10:22

asdf

221
1
10

I just modified it to get rid of the while(1) and it doesn't change anything. Updated code – Abilbel Aug 21 '18 at 10:24
Have you actually tested this? I get identical code for `while (1)` and `for (;;)`. – r3mainer Aug 21 '18 at 10:37
This is exactly the sort of micro-optimization the compiler can often do for you. – Useless Aug 21 '18 at 10:41
Sorry but the machine code for those two will be identical, unless the compiler is utter crap. And I don't see how this helps the OP? – Lundin Aug 21 '18 at 11:10

Optimize a function in a search biggest square algorithm

4 Answers4