0

I have a function like this in C (in pseudo-ish code, dropping the unimportant parts):

int func(int s, int x, int* a, int* r) {
    int i;

    // do some stuff

    for (i=0;i<a_really_big_int;++i) {
        if (s) r[i] = x ^ i;
        else r[i] = x ^ a[i];
        // and maybe a couple other ways of computing r
        // that are equally fast individually
    }

    // do some other stuff

}

This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things:

  1. Since the switch s is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time?

  2. If not, what is a good way to optimize this code?

====

Here is an update with a fuller example:

int func(int s,
         int start,int stop,int stride,
         double *x,double *b,
         int *a,int *flips,int *signs,int i_max,
         double *c)
{
  int i,k,st;
  for (k=start; k<stop; k += stride) {
    b[k] = 0;
    for (i=0;i<i_max;++i) {

      /* this is the code in question */
      if (s) st = k^flips[i];
      else st = a[k]^flips[i];
      /* done with code in question */

      b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]);
    }
  }
}

EDIT 2:

In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max) outside, making the really_big_int loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)

DenverCoder9
  • 473
  • 3
  • 9
  • 2
    If possible, do two `if` statements, each containing a loop, instead of a loop containing two `if` statements. Then you only check the condition once, not `a_really_big_int` times. This depends on the other things in your current loop being "extractable" in this way. – hnefatl Oct 06 '17 at 23:09
  • It wouldn't be too much of a stretch for a compiler to optimize this as you want. Why don't you look at the generated assembly to see if it does? – Barmar Oct 06 '17 at 23:11
  • @hnefatl Doesn't matter for a good compiler. – EOF Oct 06 '17 at 23:11
  • 1
    @hnefatl right---the reason that I wanted to avoid that is because the actual loop is several nested loops with some not-totally-trivial logic, and I don't want to have to duplicate code a bunch of times. But maybe this is the best solution – DenverCoder9 Oct 06 '17 at 23:11
  • @Barmar good idea--I'll look into it – DenverCoder9 Oct 06 '17 at 23:12
  • Do you have other stuff in the loop that isn't in the `if/else`? The compiler might not want to duplicate all that code just to optimize the `if`. – Barmar Oct 06 '17 at 23:14
  • 2
    @Barmar The example given becomes trivially vectorizable, so `gcc` *absolutely* hoists the `if()` out of the loop. – EOF Oct 06 '17 at 23:15
  • Take a look at what options your platform and compiler offer for SIMD or AVX. – jarmod Oct 06 '17 at 23:15
  • 1
    @EOF But maybe his actual code isn't exactly like the example given. He said he left out "unimportant" parts. They might be important to the GCC optimizer. – Barmar Oct 06 '17 at 23:17
  • I'll update with full example in a sec – DenverCoder9 Oct 06 '17 at 23:20
  • This example invokes an UB anyway for large enough `i` on 32 bit system – 0___________ Oct 06 '17 at 23:27
  • Is it possible to make it parallel? You could split this into 2 data sets and iterate over both in parallel. – Fantastic Mr Fox Oct 06 '17 at 23:30
  • @FantasticMrFox I am already parallelizing the crap out of it actually (see my edit)... it *still* is a bottleneck. – DenverCoder9 Oct 06 '17 at 23:31
  • @JohnKugelman Considering that the new code isn't even a compilable function, I've closevoted for lack of [MCVE]. – EOF Oct 06 '17 at 23:33
  • whoops. I renamed something `s` and forgot that was already my switch. editing... – DenverCoder9 Oct 06 '17 at 23:35
  • @G.Meyer Still not a compileable function. – EOF Oct 06 '17 at 23:37
  • @EOF ok, understood. That being said is declaring all variables, etc. really necessary for answering my question? it seems that would just add a bunch of code that aren't really relevant for what the compiler is doing. If you really want me to, I can make it compilable... – DenverCoder9 Oct 06 '17 at 23:39
  • 1
    @G.Meyer You can just make all variables be function arguments for all I care. The only point is that I can throw it at my compiler and get a result everybody else can verify (unlike when *I* have to take a hacksaw to your code to test it). – EOF Oct 06 '17 at 23:42
  • Ask the compiler to give you assembler, e.g. with `gcc -Wall -fverbose-asm -S -O2` then look into the generated assembler file `*.s` – Basile Starynkevitch Oct 06 '17 at 23:44

5 Answers5

4

One obvious way to optimize the code is pull the conditional outside the loop:

if (s)
    for (i=0;i<a_really_big_int;++i) {
        r[i] = x ^ i;
    }
else
    for (i=0;i<a_really_big_int;++i) {
        r[i] = x ^ a[i];
    }

A shrewd compiler might be able to change that into r[] assignments of more than one element at a time.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • A "shrewd compiler" doesn't need the help, it can hoist the `if()` just fine. – EOF Oct 06 '17 at 23:45
  • 1
    @EOF: You missed my point. I am talking about what happens after the `if` is hoisted. – wallyk Oct 06 '17 at 23:46
  • 1
    ... and now it can be vectorized with `_mm*_xor_si*()`, `_mm*_set_epi32()` and unaligned load/stores (unless you can insure the inputs are aligned) , but hopefully the compiler can do it itself – technosaurus Oct 07 '17 at 16:15
2

Micro-optimizations

Usually they are not worth the time - reviewing larger issue is more effective.

Yet to micro-optimize, trying a variety of approaches and then profiling them to find the best can make for modest improvements.

In addition to @wallyk and @kabanus fine answers, some simplistic compilers benefit with a loop that ends in 0.

// for (i=0;i<a_really_big_int;++i) {
for (i=a_really_big_int; --i; ) {

[edit 2nd optimization]

OP added a more compete example. One of the issues is that the compiler can not make assumption that that the memory pointed to by b and others do not overlap. This prevents certain optimizations.

Assuming they in fact to do not overlap, use restrict on b to allow optimizations. const helps too for weaker compilers that do no deduce that. restrict on the others may also benefit, again, if the reference data does not overlap.

// int func(int s, int start, int stop, int stride, double *x,
//     double *b, int *a, int *flips,
//     int *signs, int i_max, double *c) {

int func(int s, int start, int stop, int stride, const double * restrict x,
    double * restrict b, const int * restrict a, const int * restrict flips, 
    const int * restrict signs, int i_max, double *c) {
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
1

All your commands are quick O(1) command in the loop. The if is definitely optimized, so is your for+if if all your commands are of the form r[i]=somethingquick. The question may boil down for you on how small can big int be?

A quick int main that just goes from INT_MIN to INT_MAX summing into a long variable, takes ~10 seconds for me on the Ubuntu subsystem on Windows. Your commands may multiply this by a few, which quickly gets to a minute. Bottom line, this may be not be avoidable if you really are iterating a ton.

If r[i] are calculated independently, this would be a classic usage for threading/multi-processing.

EDIT:

I think % is optimized anyway by the compiler, but if not, take care that x & 1 is much faster for an odd/even check.

kabanus
  • 24,623
  • 6
  • 41
  • 74
  • re: the builtin, I think that depending on your architecture __builtin_popcount is a single machine instruction (from gcc) anyway. I'm not sure though. anyway I really doubt that the hamming weight of an int takes a lot of time to compute... – DenverCoder9 Oct 06 '17 at 23:41
  • @G.Meyer It is (returns the 1s or something). Didn't know for sure it was a single instruction though, thanks. – kabanus Oct 06 '17 at 23:45
1

Assuming x86_64, you can ensure that the pointers are aligned to 16 bytes and use intrinsics. If it is only running on systems with AVX2, you could use the __mm256 variants (similar for avx512*)

int func(int s, int x, const __m128i* restrict a, __m128i* restrict r) {
    size_t i = 0, max = a_really_big_int / 4;
    __m128i xv =  _mm_set1_epi32(x);
    // do some stuff
    if (s) {
        __m128i iv = _mm_set_epi32(3,2,1,0); //or is it 0,1,2,3?
        __m128i four = _mm_set1_epi32(4);
        for ( ;i<max; ++i, iv=_mm_add_epi32(iv,four)) {
            r[i] = _mm_xor_si128(xv,iv);
        }
    }else{ /*not (s)*/
        for (;i<max;++i){
            r[i] = _mm_xor_si128(xv,a[i]);
        }
    }
    // do some other stuff   
}
technosaurus
  • 7,676
  • 1
  • 30
  • 52
0

Although the if statement will be optimized away on any decent compiler (unless you asked the compiler not to optimize), I would consider writing the optimization in (just in case you compile without optimizations).

In addition, although the compiler might optimize the "absolute" if statement, I would consider optimizing it manually, either using any available builtin, or using bitwise operations.

i.e.

b[k] += x[st] *
        ( ((__builtin_popcount(st & signs[I]) & 1) *
           ((int)0xFFFFFFFFFFFFFFFF)) ^c[I] );

This will take the last bit of popcount (1 == odd, 0 == even), multiply it by the const (all bits 1 if odd, all bits 0 if true) and than XOR the c[I] value (which is the same as 0-c[I] or ~(c[I]).

This will avoid instruction jumps in cases where the second absolute if statement isn't optimized.

P.S.

I used an 8 byte long value and truncated it's length by casting it to an int. This is because I have no idea how long an int might be on your system (it's 4 bytes on mine, which is 0xFFFFFFFF).

Myst
  • 18,516
  • 2
  • 45
  • 67