Is there a useful difference between (p ^ q) and (p != q) for booleans?

Question

Java has two ways of checking whether two booleans differ. You can compare them with !=, or with ^ (xor). Of course, these two operators produce the same result in all cases. Still, it makes sense for both of them to be included, as discussed, for example, in What's the difference between XOR and NOT-EQUAL-TO?. It even makes sense for developers to prefer one over the other depending on context - sometimes "is exactly one of these booleans true" reads better, and other times "are these two booleans different" communicates intent better. So, perhaps which one to use should be a matter of taste and style.

What surprised me is that javac does not treat these identically! Consider this class:

class Test {
  public boolean xor(boolean p, boolean q) {
    return p ^ q;
  }
  public boolean inequal(boolean p, boolean q) {
    return p != q;
  }
}

Obviously, the two methods have the same visible behavior. But they have different bytecode:

$ javap -c Test
Compiled from "Test.java"
class Test {
  Test();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  public boolean xor(boolean, boolean);
    Code:
       0: iload_1
       1: iload_2
       2: ixor
       3: ireturn

  public boolean inequal(boolean, boolean);
    Code:
       0: iload_1
       1: iload_2
       2: if_icmpeq     9
       5: iconst_1
       6: goto          10
       9: iconst_0
      10: ireturn
}

If I had to guess, I'd say that xor performs better, since it just returns the result of its comparison; adding in a jump and an extra load just seems like wasted work. But instead of guessing, I benchmarked a few billion calls to both methods using Clojure's "criterium" benchmarking tool. It's close enough that while it looks like xor is a bit faster I'm not good enough at statistics to say whether the results are significant:

user=> (let [t (Test.)] (bench (.xor t true false)))
Evaluation count : 4681301040 in 60 samples of 78021684 calls.
             Execution time mean : 4.273428 ns
    Execution time std-deviation : 0.168423 ns
   Execution time lower quantile : 4.044192 ns ( 2.5%)
   Execution time upper quantile : 4.649796 ns (97.5%)
                   Overhead used : 8.723577 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 25.4745 % Variance is moderately inflated by outliers
user=> (let [t (Test.)] (bench (.inequal t true false)))
Evaluation count : 4570766220 in 60 samples of 76179437 calls.
             Execution time mean : 4.492847 ns
    Execution time std-deviation : 0.162946 ns
   Execution time lower quantile : 4.282077 ns ( 2.5%)
   Execution time upper quantile : 4.813433 ns (97.5%)
                   Overhead used : 8.723577 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 22.2554 % Variance is moderately inflated by outliers

Is there some reason to prefer writing one over the other, performance-wise¹? Some context in which the difference in their implementation makes one more suitable than the other? Or, does anyone know why javac implements these two identical operations so differently?

¹ Of course, I will not recklessly use this information to micro-optimize. I'm just curious how this all works.

Stick to your own original conclusion: "prefer one over the other depending on context" — Andreas, Apr 09 '19 at 18:01
Introducing a test-and-branch is obviously going to have some effect on performance. How much depends on a variety of factors, not the least of which is the predictability of that branch. Plenty of prior art on this question; I'll shamelessly plug [my own answer](https://stackoverflow.com/questions/40991778/) as a starting point. I can't post an actual answer, because I'm unfamiliar with how Java bytecode is translated into machine code. Is there an optimizer situated in between? Probably yes. Either way, beware of premature micro-optimizations. Write code first to say what you mean. — Cody Gray - on strike, Apr 09 '19 at 18:08
`p != q` suggest using a comparison instruction, while `p ^ q` suggest using `xor` instruction. That's what you see in bytecode. If it is further compiled to machine code in this natural way, then `p ^ q` would probably be somewhat faster if result is used as a number or stored to memory, but marginally slower if used as a branch condition. — zch, Apr 09 '19 at 18:09
!= is more common and thus easier to read and understand for more programmers which is not irrelevant. — Joakim Danielson, Apr 09 '19 at 18:13
Why would `p ^ q` be "marginally slower if used as a branch condition", @zch? — Cody Gray - on strike, Apr 09 '19 at 18:13
@CodyGray, I thought about having `xor` instruction and then separate test of the result. Perhaps you are right that no production level compiler would be that dumb. — zch, Apr 09 '19 at 18:17
@CodyGray Indeed the translation from bytecode is complicated and involves an optimizer. Often, bytecode is interpreted for a while, and only JIT-compiled to native code once it is determined to be a performance hotspot at runtime. The JIT optimizer can use runtime information to guide its optimization - I'm not an expert, but I imagine it may be able to use this to guide its branch prediction, for example. This is one reason it's important for JVM benchmarks to "warm up the JIT", as criterium does. — amalloy, Apr 09 '19 at 18:21
@CodyGray, but if compiler uses `xor` and it's flags directly it can still damage optimization in some cases, as it mutates the register that holds `p` (or `q`). — zch, Apr 09 '19 at 18:24
It's a complicated topic, @zch. Virtually all modern architectures set flags as a result of an arithmetic or bitwise operation, including XOR/EOR, so the branch can be done directly on the state of those flags, including "was last result zero?". If there's any optimizer in between (and amalloy confirms that there is), you're right, it certainly should not be that dumb. Regarding the clobber of one of the operands to XOR, some architectures support three-operand instructions, where the destination is kept separate. x86 doesn't, but a reg-reg copy (`mov`) is *very* cheap/free, due to renaming. — Cody Gray - on strike, Apr 09 '19 at 18:31

Eugene · Accepted Answer · 2019-05-03T08:49:34.357

Well, I am going to provide how the CPU translates that shortly and update the post, but in the meanwhile, you are looking at waaaay too small difference to care.

byte-code in java is not an indication of how fast (or not) a method will execute, there are two JIT compilers that will make this method look entirely different once they are hot enough. also javac is known to do very little optimizations once it compiles the code, the real optimizations come from JIT.

I've put up some tests using JMH for this using either C1 compiler only or replacing C2 with GraalVM or no JIT at all... (lots of testing code follows, you can skip it and just look at the results, this is done using jdk-12 btw). This code is using JMH - the de facto tool to use in java world of micro-benchmarks (which are notoriously error-prone if done by hand).

@Warmup(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class BooleanCompare {

    public static void main(String[] args) throws Exception {
        Options opt = new OptionsBuilder()
            .include(BooleanCompare.class.getName())
            .build();

        new Runner(opt).run();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public boolean xor(BooleanExecutionPlan plan) {
        return plan.booleans()[0] ^ plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(1)
    public boolean plain(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-Xint")
    public boolean xorNoJIT(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-Xint")
    public boolean plainNoJIT(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:-TieredCompilation")
    public boolean xorC2Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:-TieredCompilation")
    public boolean plainC2Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:TieredStopAtLevel=1")
    public boolean xorC1Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1, jvmArgsAppend = "-XX:TieredStopAtLevel=1")
    public boolean plainC1Only(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1,
        jvmArgsAppend = {
            "-XX:+UnlockExperimentalVMOptions",
            "-XX:+EagerJVMCI",
            "-Dgraal.ShowConfiguration=info",
            "-XX:+UseJVMCICompiler",
            "-XX:+EnableJVMCI"
        })
    public boolean xorGraalVM(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1,
        jvmArgsAppend = {
            "-XX:+UnlockExperimentalVMOptions",
            "-XX:+EagerJVMCI",
            "-Dgraal.ShowConfiguration=info",
            "-XX:+UseJVMCICompiler",
            "-XX:+EnableJVMCI"
        })
    public boolean plainGraalVM(BooleanExecutionPlan plan) {
        return plan.booleans()[0] != plan.booleans()[1];
    }

}

And the results:

BooleanCompare.plain         avgt    2    3.125          ns/op
BooleanCompare.xor           avgt    2    2.976          ns/op

BooleanCompare.plainC1Only   avgt    2    3.400          ns/op
BooleanCompare.xorC1Only     avgt    2    3.379          ns/op

BooleanCompare.plainC2Only   avgt    2    2.583          ns/op
BooleanCompare.xorC2Only     avgt    2    2.685          ns/op

BooleanCompare.plainGraalVM  avgt    2    2.980          ns/op
BooleanCompare.xorGraalVM    avgt    2    3.868          ns/op

BooleanCompare.plainNoJIT    avgt    2  243.348          ns/op
BooleanCompare.xorNoJIT      avgt    2  201.342          ns/op

I am not a versatile enough person to read assembler, though I sometimes like to do that... Here are some interesting things. If we do:

C1 compiler only with !=

/*
 * run many iterations of this with :
 *  java -XX:+UnlockDiagnosticVMOptions  
 *       -XX:TieredStopAtLevel=1  
 *       "-XX:CompileCommand=print,com/so/BooleanCompare.compare"  
 *       com.so.BooleanCompare
 */
public static boolean compare(boolean left, boolean right) {
    return left != right;
}

we get:

  0x000000010d1b2bc7: push   %rbp
  0x000000010d1b2bc8: sub    $0x30,%rsp  ;*iload_0 {reexecute=0 rethrow=0 return_oop=0}
                                         ; - com.so.BooleanCompare::compare@0 (line 22)

  0x000000010d1b2bcc: cmp    %edx,%esi
  0x000000010d1b2bce: mov    $0x0,%eax
  0x000000010d1b2bd3: je     0x000000010d1b2bde
  0x000000010d1b2bd9: mov    $0x1,%eax
  0x000000010d1b2bde: and    $0x1,%eax
  0x000000010d1b2be1: add    $0x30,%rsp
  0x000000010d1b2be5: pop    %rbp

To me, this code is a bit obvious: put 0 into eax, compare (edx, esi) -> if not equal put 1 into eax. return eax & 1.

C1 compiler with ^:

public static boolean compare(boolean left, boolean right) {
     return left ^ right;
}



  # parm0:    rsi       = boolean
  # parm1:    rdx       = boolean
  #           [sp+0x40]  (sp of caller)
  0x000000011326e5c0: mov    %eax,-0x14000(%rsp)
  0x000000011326e5c7: push   %rbp
  0x000000011326e5c8: sub    $0x30,%rsp   ;*iload_0 {reexecute=0 rethrow=0 return_oop=0}
                                          ; - com.so.BooleanCompare::compare@0 (line 22)

  0x000000011326e5cc: xor    %rdx,%rsi
  0x000000011326e5cf: and    $0x1,%esi
  0x000000011326e5d2: mov    %rsi,%rax
  0x000000011326e5d5: add    $0x30,%rsp
  0x000000011326e5d9: pop    %rbp

I don't really know why and $0x1,%esi is needed here, otherwise this is fairly simple too, I guess.

But if I enable C2 compiler, things are a lot more interesting.

/**
 * run with java
 * -XX:+UnlockDiagnosticVMOptions
 * -XX:CICompilerCount=2
 * -XX:-TieredCompilation
 * "-XX:CompileCommand=print,com/so/BooleanCompare.compare"
 * com.so.BooleanCompare
 */
public static boolean compare(boolean left, boolean right) {
    return left != right;
}



  # parm0:    rsi       = boolean
  # parm1:    rdx       = boolean
  #           [sp+0x20]  (sp of caller)
  0x000000011a2bbfa0: sub    $0x18,%rsp
  0x000000011a2bbfa7: mov    %rbp,0x10(%rsp)                

  0x000000011a2bbfac: xor    %r10d,%r10d
  0x000000011a2bbfaf: mov    $0x1,%eax
  0x000000011a2bbfb4: cmp    %edx,%esi
  0x000000011a2bbfb6: cmove  %r10d,%eax                     

  0x000000011a2bbfba: add    $0x10,%rsp
  0x000000011a2bbfbe: pop    %rbp

I don't even see the classic epilog push ebp; mov ebp, esp; sub esp, x, instead something very un-usual (at least for me) via:

 sub    $0x18,%rsp
 mov    %rbp,0x10(%rsp)

 ....
 add    $0x10,%rsp
 pop    %rbp

Again, someone more versatile than me, can explain hopefully. Otherwise it's like a better version of the C1 generated:

xor    %r10d,%r10d // put zero into r10d
mov    $0x1,%eax   // put 1 into eax
cmp    %edx,%esi   // compare edx and esi
cmove  %r10d,%eax  // conditionally move the contents of r10d into eax

AFAIK cmp/cmove is better than cmp/je because of branch-prediction - this is at least what I've read...

XOR with C2 compiler:

public static boolean compare(boolean left, boolean right) {
    return left ^ right;
}



  0x000000010e6c9a20: sub    $0x18,%rsp
  0x000000010e6c9a27: mov    %rbp,0x10(%rsp)                

  0x000000010e6c9a2c: xor    %edx,%esi
  0x000000010e6c9a2e: mov    %esi,%eax
  0x000000010e6c9a30: and    $0x1,%eax
  0x000000010e6c9a33: add    $0x10,%rsp
  0x000000010e6c9a37: pop    %rbp

It sure looks like it's almost the same as C1 compiler generated.

Your broader point is true enough---the difference will be tiny. But you need to be *very* careful with trying to prove this using benchmark numbers. Micro-benchmarking is *notoriously* difficult because all sorts of confounding factors affect your measurements, including things done by the CPU itself, such as branch prediction, caching, etc. Beyond that, the measurement tool itself is also liable to affect the results. Not to mention the very distinct possibility that, with a good enough optimizer, *all* of the code could be elided, meaning that you're testing essentially nothing. — Cody Gray - on strike, Apr 09 '19 at 22:19
@CodyGray I've edited the answer... if you have the time to explain some on the things I don't understand... thx! — Eugene, May 03 '19 at 08:50

Is there a useful difference between (p ^ q) and (p != q) for booleans?

1 Answers1

Linked