Why does tree vectorization make this sorting algorithm 2x slower?

Question

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):

#include <time.h>
#include <stdio.h>

#define ELEMENTS 100000

int main() {
  int a[ELEMENTS] = { 0 };
  clock_t start = clock();
  for (int i = 0; i < ELEMENTS; ++i) {
    int lowerElementIndex = i;
    for (int j = i+1; j < ELEMENTS; ++j) {
      if (a[j] < a[lowerElementIndex]) {
        lowerElementIndex = j;
      }
    }
    int tmp = a[i];
    a[i] = a[lowerElementIndex];
    a[lowerElementIndex] = tmp;
  } 
  clock_t end = clock();
  float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
  printf("Time: %2.3f\n", timeExec);
  printf("ignore this line %d\n", a[ELEMENTS-1]);
}

After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:

Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s)

    cmpl    %ecx, %esi
    cmovl   %edx, %edi
    cmovl   %esi, %ecx

As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.

I guess the cmovl instructions cannot benefit from branch prediction.

Questions:

Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization? The documentation is quite vague, I would need a little more explanation to understand what's happening.

Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.

Since it also came up: I have a Core i5 CPU.

Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.

#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8; 
constexpr double p = 0.50;

int main() {
  printf("p = %.2f\n", p);
  int* a = new int[ELEMENTS];
  mt19937 mt(1759);
  bernoulli_distribution rnd(p);
  for (int i = 0 ; i < ELEMENTS; ++i){
    a[i] = rnd(mt)? i : -i;
  }
  auto start = high_resolution_clock::now();
  int lowerElementIndex = 0;
  for (int i=0; i<ELEMENTS; ++i) {
    if (a[i] < a[lowerElementIndex]) {
      lowerElementIndex = i;
    }
  } 
  auto finish = high_resolution_clock::now();
  printf("%ld  ms\n", duration_cast<milliseconds>(finish-start).count());
  printf("Ignore this line   %d\n", a[lowerElementIndex]);
  delete[] a;
}

The loops of interest:

This will be referred to as cmov

g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L30:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    movslq  %eax, %rdx
    cmovl   %rdx, %rbp
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L30

This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.

g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L29:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    jge .L28
    movslq  %eax, %rbp
.L28:
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L29

The difference side by side

cmpl    %edx, (%rbx,%rax,4) |     cmpl  %edx, (%rbx,%rax,4)
movslq  %eax, %rdx          |     jge   .L28
cmovl   %rdx, %rbp          |     movslq    %eax, %rbp
                            | .L28:

The execution time as a function of the Bernoulli parameter p

effect of branch prediction

The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.

Do you get the same results with real (random) data instead of all zeroes ? — Paul R, Jan 10 '14 at 22:58
Just a guess, but I think the tree part of tree-vectorize means that it's working on the "tree" form of the code (gcc converts your code to a bunch of different forms after parsing it, and tree is one of those forms iirc). Vectorize of course just refers to using vector instructions (SSE, MMX, 3D-Now, etc). — CrazyCasta, Jan 10 '14 at 22:59
@PaulR [Yes, I do.](http://stackoverflow.com/a/21052640/341970) Others reported the same. — Ali, Jan 10 '14 at 22:59
Here's what I'm talking about btw: http://gcc.gnu.org/onlinedocs/gccint/Tree-SSA-passes.html#Tree-SSA-passes — CrazyCasta, Jan 10 '14 at 23:05
@CrazyCasta OK, I see, thanks. Unfortunately I don't see this `cmov` instruction mentioned there. — Ali, Jan 10 '14 at 23:07
cmov is an assembly instruction. Here's a page about cmov: http://www.rcollins.org/p6/opcodes/CMOV.html (after looking at it, I don't see what it has to do w/ vectorization, but then I don't see how vectorization is being used in the code gcc generates either). — CrazyCasta, Jan 10 '14 at 23:09
@CrazyCasta :) And do you agree that such and instruction cannot benefit from branch prediction? — Ali, Jan 10 '14 at 23:11
I'm not sure what affect it would have on branch prediction. I think the processor just assumes the mov will occur and wastes a cycle if it doesn't. It's not branching at all. — CrazyCasta, Jan 10 '14 at 23:14
@CrazyCasta Yes, that's what I suspect too. It means that it cannot benefit from branch prediction. — Ali, Jan 10 '14 at 23:16
While I can't answer all of your questions, I can say that the conditional move (`cmovl`) instruction uses a form branch *predication*, not branch *prediction*. The point is that with the `cmovl` there will not be a branch. Instead, the instruction will be sent down the pipeline before the result of the previous `cmpl` is known and the comparison result will then be "forwarded" to enable/prevent the move prior to its writeback. So it's a different form of optimization than branch *prediction*. — Turix, Jan 10 '14 at 23:16
@Turix And if we never take that branch, all those efforts are wasted... Pretty much what I suspected. — Ali, Jan 10 '14 at 23:18
@Ali right, so basically because your array is all 0, every time through the loop in the slower "optimized" code, there will be two "wasted" instructions in the pipeline (the two `cmovl` instructions). This might be enough to account for the entire difference in runtimes between the two. (I should emphasize that this big difference is primarily because the array is all 0. In a "real" array, this cannot be the whole story...) — Turix, Jan 10 '14 at 23:38
@Ali Yeah, that's what I meant about not being the whole story; in other words, not the whole explanation for the weird timings you're seeing. At this point, I can only *speculate* that there is some issue with the `mov` instruction that reads the array and the branch predication, such that the slow code is apparently having to wait on the result of the lookup every time through, whereas the fast code is only having to wait on the result of the lookup when the branch prediction is wrong. — Turix, Jan 11 '14 at 00:06
@Ali (Also, note that things will vary by processor. For example, your Celeron is notorious for having cache issues, which might be in play with a very large array.) — Turix, Jan 11 '14 at 00:06
@Ali OK. Although I just realized that the cache thing I said a minute ago is a red herring as the caching behaviour will be exactly the same for both snippets of assembly. (i.e., It will still affect overall performance of your algorithm in general, but is not related to the mystery of the timing differences.) — Turix, Jan 11 '14 at 00:14
@Turix OK, please check the updated question. It's night here so I am going to bed now. — Ali, Jan 11 '14 at 00:43
@PaulR I have updated the question with profiling information taken on random data. The sensitivity to the `-ftree-vectorize` flag still remains. — Ali, Jan 11 '14 at 00:45
Completely random and uniform elements in a selection sort do not cause many branch prediction failures. The first element is `k_0` uniform from 0 to N. Each element after that has a `k_0/N` chance of being smaller, causing a branch. That element is then uniformly selected from 0 to `k_0-1`. Imagine if it always picked the middle one: then after a logarithmic number of steps, you'd find the smallest one left. On average, that is the performance you get. On a `100000` array, `lg` is `16`, meaning the branch starts out going one way 6000 times more than the other. — Yakk - Adam Nevraumont, Jan 11 '14 at 01:27
In effect, your inefficient sort means that each comparison has little entropy, so is highly predictable, so branch prediction is nearly perfect. Now, we can arrange for pathological data. We want each comparison to be 50-50 chance of being bigger or smaller than the current lowest. Set `array[i] = ` +/- `i`, where the sign is picked randomly, and reprofile (alternating might work, but having a branch predictor that handles alternating would be easy in hardware, so random will be evil enough). — Yakk - Adam Nevraumont, Jan 11 '14 at 01:31
Here is a http://ideone.com/UDyl2N live example of the slow/fast/medium difference of making things maximally unpredictable. Random numbers (medium) is as good as all zeros (fast). Pathological (slow) is much slower, as each branch has a 50-50 chance of succeeding. — Yakk - Adam Nevraumont, Jan 11 '14 at 01:44
@Yakk Thanks for the very valuable comments! I went a step further and I both counted how many times a branch is taken and also printed to the console which branch was taken to see patterns. Flipping the sign with `p=0.5` favors one branch more (ca. 20% of all is the if branch) and no matter how you set `p` there is a pattern: in the early phase one branch is picked, at the end the other branch is favored and there is a gray area in the middle of the execution where neither is preferred. Setting `p=0.01` seems a reasonable compromise. See my code here http://ideone.com/9iwcMX — Ali, Jan 11 '14 at 16:41
@Yakk Now the thing is: I don't see any pathological slow down. True, it is *slightly* slower on my machine but `-ftree-vectorize` has a much bigger effect on the speed; passing `-fno-tree-vectorize` still wins. Of course, it can happen that the gray area, where neither branch is preferred, is not large enough and branch prediction still wins. Do you really see that much slowdown on your machine with the "pathological" case? — Ali, Jan 11 '14 at 16:44
@Yakk Ah, OK, I see what you meant. But you have forgotten about the swaps in the sorting algorithm which still fixes your evil dataset to a large extent... It seems to me, for selection sort it is very hard to create a dataset that yields lots of branch mispredictions. Please also see my previous comments. OK, I will update the question accordingly. — Ali, Jan 11 '14 at 18:31
@Yakk I have updated the question based on your comments. Please read and please consider writing up your comments in an answer! — Ali, Jan 11 '14 at 23:40

Turix · Accepted Answer · 2014-01-19T03:51:28.897

Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.

(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)

First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)

Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:

if (a[j] < a[lowerElementIndex]
    lowerElementIndex = j;

Apparently, this happens even if there is no vectorization.

This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).

Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.

However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)

On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be ~~roughly~~more than [edited per @Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.

[The following was removed due to my mistaken assumption about your processor per your comment.]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway.

[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given @Yakk's point about your sort algorithm).

Thanks! Now one more question: What makes you think I have a Celeron? I have a Core i5. How does that change your answer? — Ali, Jan 11 '14 at 01:07
@Ali sorry for the bad assumption. I was going from this bit in the [other question](http://stackoverflow.com/q/21050130/341970): "I forgot to mention that I'm on an old PC with a Intel Celeron at2.8GHz processor and linux (fedora 20 with xfce) installed." which I mistakenly thought was yours, but I now see you just edited. Sorry! And while I'm not sure, I think this *should* change my answer, as I believe your i5 might support vectorization. (I am going to research that now and will update my answer.) — Turix, Jan 11 '14 at 01:34
No problem at all. Thanks for your efforts, I greatly appreciate it. I have tried and `-fno-if-conversion` helps but only a little (7.4s reduced to 6.4s) but `-fno-tree-vectorize` still wins (4.7s). Tested on an evil data set that does its best to break the branch prediction. Now, please check my answer to Yakk's comments. — Ali, Jan 11 '14 at 17:12
@Ali Interesting. I'd be interested in seeing the rest of the assembly code for the s_sort function (i.e., the "outer loop"), and the full profile of the function. I'm wondering if the attempt to vectorize the "inner loop" screws up the obvious optimization of the outer loop. (I can imagine a couple ways this might happen, including for example requiring another unnecessary lookup of `a[i]`.) — Turix, Jan 11 '14 at 19:05
@Ali also, I'm assuming you looked at the assembly code in the `-fno-if-conversion` case and it seemed sane. I was assuming it would be nearly identical to the `-fno-tree-vectorize` case, so I'm also interested in what the main differences were that remained from the attempt to vectorize. — Turix, Jan 11 '14 at 19:07
Please give me some time, I will update the question with new results. Be careful with editing your answer though: you are 3 edits away from making it community wiki and so lose the reputation points for this valueable answer! — Ali, Jan 11 '14 at 20:05
I've updated the question with a lot of new data. Unfortunately I *had to* remove some of the old stuff because the post was incredibly long and cluttered with it. This, unfortunately, means that some parts of your answer also became obsolete: **I am very sorry for this.** The `-fno-if-conversion` flag does the trick, no `cmov` instructions! Since you only have 3 more edits of your post, you might consider marking it on the top "Original answer before the question updates" and submit a new answer which considers the new infos. But be more careful with the number of edits this time! — Ali, Jan 11 '14 at 23:53
@Ali cool graph! sorry I got waylaid by work for a week. I take it the graph in combination with the characteristics of the (original) selection sort algorithm sufficiently answer all of your questions at this point, right? If so, I'll just leave my answer alone. But if there's still more to explain/understand, I'm happy to keep digging into this. (It was a fun mystery!) Cheers! — Turix, Jan 19 '14 at 03:49
Yes, I pretty much got my answer with one exception: It seems to me that the only way to avoid generating the `cmov` instruction is to move the loop in question into a separate source file and compile only that particular file with the `-fno-if-conversion` flag. (Or do inline assembly which is probably even worse.) Well, if you have any better ideas than this, then please update your answer. Otherwise, just leave it the way it is. I will still leave this question open for a while to generate traffic (and upvotes for the both of us ;) ). — Ali, Jan 19 '14 at 12:14

Why does tree vectorization make this sorting algorithm 2x slower?

1 Answers1

Linked