Optimization: why is < more expensive than multiple !=

Question

I have a 2d array of chars that I need to do some operations on. In some cases, I need to check if character is a-h. I used to accomplish this by checking if the character was not equal to any of the other characters (there are only 5 other characters). However, I recently had the idea that I could instead check if the character was < 'j' to get the same result with hopefully fewer assembly instructions.

In some places I put it, it did result in a small speed-up, but in others it resulted in a rather large slowdown. Any ideas why this is? What is the relative expense of != as opposed to < in if statements?

Here is an example code snippet:

if( arr[r][c] == arr[r][c+1] && arr[r][c] == arr[r][c+2]
         && arr[r][c] != 'q' && arr[r][c] != 'r' && arr[r][c] != 's' && arr[r][c] != 't')

vs

if( arr[r][c] == arr[r][c+1] && arr[r][c] == arr[r][c+2]
         && arr[r][c] < 'j')

It wasn't a huge time difference but it was consistent. Replacing it at one point results in a time decrease of ~0.02 seconds, while at another point increases the time by ~0.07 seconds — user2503981, Aug 01 '13 at 01:47
This is a bit of a guess, but != might be faster because it only needs to compare enough bits to find two that don't match, whereas < probably needs to look at all of the bits. Not sure if the logic circuits are actually set up to take advantage of that, though. — seaotternerd, Aug 01 '13 at 02:00
Well, the two expressions are not equivalent -- they will pass different patterns through. What happens in the body of the if and else legs presumably would affect overall cost. — Hot Licks, Aug 01 '13 at 03:21
@user2503981 if you can provide an SSCCE as Mystical says, we might be able to get faster runtimes for you ;) — Eiyrioü von Kauyf, Aug 01 '13 at 04:16
The reason I provided a snippet instead of an SSCCE is that it is part of a much larger program. I can tell you that the array is static and 12x6 and that there are two functions in particular that make use of the code snippet I provided and they are called 3.7 million and 180 thousand times respectively over a single run of the program. I am compiling with VS2012 with /O2 optimization. Interestingly enough the one that is called 180k times improves using the < operator and the 3.7 million one slows down. — user2503981, Aug 01 '13 at 06:35
it would just help us to fiddle around with it :). also ignore undefined, he's throwing words around; your question is a good one as shown by the votes — Eiyrioü von Kauyf, Aug 01 '13 at 06:43
you might want to look at this [branch predicting explanation](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array/11227902#11227902) — mewa, Aug 01 '13 at 09:08
While a sscce would not be the same as your full program, that is not the point. The point is to reproduce the problem (different unexpected execution speed) in a simpler context, so it can be reasoned about easier. Making the priblem simpler *may fail*: maybe all the details matter. But maybe not. And you are not even trying to make a simple reproducible example. — Yakk - Adam Nevraumont, Aug 01 '13 at 12:04
I don't know if it's optimized out or not but if a value is used many times inside the loop, you should store it to a temporary variable to prevent the compiler to re-read it many times — phuclv, Sep 28 '13 at 01:17

J. Alfred Prufrock · Answer 1 · 2013-08-01T05:29:32.920

5

If I understand your question correctly, it seems that you wish to check if all elements of an array column are between the characters 'a' and 'h' and are identical, and you want to optimize this process.

Should you happen to know some assembly language, I strongly recommend using a disassembler to find out what exactly is occurring in your function during execution. All compilers and optimization levels are slightly different. However, a bare minimum of operations for a comparison of two values in memory would consist of:

. loading the two variables in memory to the processor registers (several clock cycles)

. performing an equality test on the values in the two registers (1 clock cycle)

. executing a jump command based on the flags register (intel processors)(another clock cycle)

Now this is about as simple of an operation as you can get for a processor, but since you have stacked comparison operations, the time required for these checks accumulate (particularly the clock cycles needed for memory access.

Therefore, to reduce the time needed for these comparisons, one needs to reduce the number of comparisons. Remember that characters 'a' through 'h' have ascii values between 0x61 and 0x68 (decimal 97 to 104). You can ascertain if a character is between 'a' through 'h' in about three comparison operations by:

if(arr[r][c] >= 97 && arr[r][c] <= 104)

Check only one value of the column and use this bit-twiddling trick to determine if all elements in the column are the same:

if(((arr[r][c] ^ arr[r][c+1]) + (arr[r][c] ^ arr[r][c+2]) + ...*etc*) == 0)

The "xor"('^') comparison takes a single clock cycle, as does addition, and if there are any dissimilarities between any two column entities, the operation will result in a nonzero result. This method should increase in linear time with number of column elements, and as an added bonus an optimizing compiler might be able to keep 'arr[r][c]' in one of the registers during the operation.

edited Aug 01 '13 at 05:29

answered Aug 01 '13 at 03:40

J. Alfred Prufrock

233
2
7

Have you actually measured those premature optimisations with full optimisations enabled during compilation? – autistic Aug 01 '13 at 04:05
No, and if the questioner and myself were to use different compilers or different optimization levels, it may have been a moot point to do so. However, in my own experience, I have found that excessive variable comparisons cause a execution delay that becomes nontrivial even for moderately-sized inputs (consider a string-search algorithm). I simply offered a means by which these comparisons could be minimized for the task at hand. As for portability, ascii or ebcdic encoding was not specified in the question, and I really cannot imagine any processor without xor and addition operations. – J. Alfred Prufrock Aug 01 '13 at 05:18
Using if( !((mboard[r][c] ^ mboard[r][c+1]) + (mboard[r][c] ^ mboard[r][c+2])) && mboard[r][c]<'j') I got the same speed as my second example in the main question, which ran slower than the first example. – user2503981 Aug 01 '13 at 06:59
@J.AlfredPrufrock Your suggestion might make sense if it's to remove redundant operations. However, it seems like you're recommending the introduction of more comparisons or using different operations. Given that most compilers are intelligent enough to remove unused variables and code, don't you think it could replace comparisons with xor operations quite easily? Is removing the lazy evaluation of `&&` really an optimisation? – autistic Aug 01 '13 at 07:40
@user2503981 - I have dissassembled compilations the two `if` conditions that you had previously posted using gcc 4.4.3 (i486 target, default optimization), and have found that it takes about 7-8 processor instructions to clear the registers, move memory variables to registers, and effect a comparison. Each != comparison adds to the total number of instructions needed, and is much longer than a single < comparison. My only guess is that the particular syntax you used affects something else in the program, or that somehow the compiler bypasses making all of the != comparisons sequentially. – J. Alfred Prufrock Aug 01 '13 at 17:32
In the end, the way I got it running faster was using Profile-Guided Optimizations. From what I know it allows the compiler to choose the branches to predict based on actual usage rather than just a guess. – user2503981 Aug 01 '13 at 18:37

score 1 · Answer 2 · answered Aug 01 '13 at 03:44

1

Modern compilers/CPUs use branch prediction to pre-fetch candidate outcomes favoring some execution paths over others. Your compilations predicted different and thus different results. Results are likely dependent on the 2d array's contents. Further, the advantage may be different on different compilers/CPUs. Search on branch prediction - there are some great answers out there.

answered Aug 01 '13 at 03:44

chux - Reinstate Monica

143,097
13
135
256

1

Indeed! He might want to try with binary logic operators such as (test) & (test) & (test) instead of using the short-circuiting && operator. Don't forget the extra parentheses. That would run all the comparisons but do no branching. It might be considerably faster. – Zan Lynx Aug 01 '13 at 05:07
1

Not "AND" but "XOR", the second `if` suggestion only involves a single comparison operation, the remaining 'comparisons' are mathematical. The result of XOR between any two identical numbers is 0. Thus if all variables are identical, all XOR operations will result in 0, and thus the sum of all results will be 0. – J. Alfred Prufrock Aug 01 '13 at 05:26

autistic · Answer 3 · 2013-08-01T04:31:23.720

1

Don't focus so much upon speed. Write a program that solves an actual, meaningful task, first. Once that's done, use a profiler to determine which parts of that program are the most significant bottlenecks. Until you have a program written to solve your actual, meaningful task, you should focus on writing portably, well-defined code rather than code that is fast.

Your notion of speed is not in the C standard. In fact, there are no guarantees with regards to speed, here. There are fast compilers and slow compilers, and even fast and slow C interpreters. As a result, your question with regards to speed is invalid. If your C compiler doesn't produce roughly identical code (in terms of speed) in this case, then either learn how to enable full optimisation or get a new C compiler.

This doesn't look portable:

if( arr[r][c] == arr[r][c+1] && arr[r][c] == arr[r][c+2]
     && arr[r][c] < 'j')

On systems where EBCDIC is used, 'j' - 'i', which you assume to be one is in fact 145 - 137 (twelve). Your test includes eleven additional characters that aren't alphabetical. I suggest using strchr("abcdefghi", a[r][c]) until you're concerned about performance. If you're concerned about the speed of this (which you shouldn't be, since it's a tiny task in anything that solves an actual problem), you could try converting this to a jump table by using a switch:

if (arr[r][c] == arr[r][c+1] && arr[r][c] == arr[r][c+2]) {
    switch (a[r][c]) {
        case 'a': case 'b': case 'c':
        case 'd': case 'e': case 'f':
        case 'g': case 'h': case 'i':
        /* XXX: Insert code that runs when a[r][c] is in "abcdefghi"... */
        break;
    }
}

To measure this optimisation, you could use a profiler as suggested in the first paragraph.

edited Aug 01 '13 at 04:31

answered Aug 01 '13 at 04:10

autistic

1
3
35
80

2

In general compilers can get similar runtimes for many things; A discussion of speed is not trivial. e.g. a recursive fibonacci versus an iterative fibonacci. Please don't give @user250391 the wrong idea. – Eiyrioü von Kauyf Aug 01 '13 at 04:15
@EiyrioüvonKauyf A discussion of speed for a program that doesn't solve an actual problem *is* trivial. Which problem do you hope to solve with your fibonacci? Perhaps there are more significant bottlenecks to focus on. That's why profilers exist, isn't it? I have added a paragraph to the start of this answer to address this. Is that what you were trying to say? – autistic Aug 01 '13 at 04:31
this: "Your notion of speed is not in the C standard. In fact, there are no guarantees with regards to speed, here. There are fast compilers and slow compilers, and even fast and slow C interpreters. As a result, your question with regards to speed is invalid. If your C compiler doesn't produce identical code in this case, then either learn how to enable full optimisation or get a new C compiler." is plain _WRONG_ and full optimizations is NOT the answer. i shudder to think what code you write. And whether it solves a problem or not does not matter - you're discounting profiling entirely – Eiyrioü von Kauyf Aug 01 '13 at 04:33
1

@EiyrioüvonKauyf Please show me which section of the C standard defines speed. Prove what you say with citations. You may also wish to review my edit to that section. – autistic Aug 01 '13 at 04:36
1

continued in chat >> Lounge – Eiyrioü von Kauyf Aug 01 '13 at 04:50
@EiyrioüvonKauyf C++ is irrelevant to C, by the way. – autistic Aug 01 '13 at 06:30
As I mentioned in a different comment, this snippet is part of a large program. It does use ASCII, and these comparisons do take place in the bottleneck function (assuming 50% of runtime is significant). – user2503981 Aug 01 '13 at 06:39
@EiyrioüvonKauyf Then why would you want to continue this discussion there? Are you trying to teach that C++ is a superset of C? Are you trying to teach that an actual implementation of C can't apply optimisations that the C standard says can be applied? I suggest looking up the term "actual implementation" in the standard. – autistic Aug 01 '13 at 06:41
1

@user2503981 Is it a significant bottleneck? I think not. Perhaps you might need to look at more suitable algorithms, data structures, or cache locality... – autistic Aug 01 '13 at 06:42
1

I was more interested in learning why using one < was more expensive than using 4 !=, especially since it was only more expensive in one location in my program. – user2503981 Aug 01 '13 at 07:07
@user2503981 Probably something implementation-defined, such as twos complement integer representation. In addition to removing the signed integer representation details, I suggest removing some of the array indexes: `unsigned char *foo = arr[r] + c; if (foo[0] == foo[1] && foo[0] == foo[2] && foo[0] < 'j') { /* ... */ }`. The removal of redundant array indexes isn't necessarily an optimisation, but it certainly improves the aesthetics of the code. – autistic Aug 01 '13 at 07:25

Optimization: why is < more expensive than multiple !=

3 Answers3