Optimizing 128bit Sequence bitwise operations in Java

Question

In order to speed up my Java Code for a problem, I have been working specifically on a Class that does bit-wise operations with 128Bits by manipulating two longs (See implementation). I also actually only need this datastructure for 100 Bits, but i figured there was no better way to implement this.

public class BitBoard {

//Bit-Masks for all N-Bits from the RIGHT
public final static long[] GET_N_BITS_FROM_RIGHT = {0x0000000000000000L, 0x0000000000000001L, 0x0000000000000003L, 0x0000000000000007L, 0x000000000000000fL, 0x000000000000001fL, 0x000000000000003fL, 0x000000000000007fL, 0x00000000000000ffL, 0x00000000000001ffL, 0x00000000000003ffL, 0x00000000000007ffL, 0x0000000000000fffL, 0x0000000000001fffL, 0x0000000000003fffL, 0x0000000000007fffL, 0x000000000000ffffL, 0x000000000001ffffL, 0x000000000003ffffL, 0x000000000007ffffL, 0x00000000000fffffL, 0x00000000001fffffL, 0x00000000003fffffL, 0x00000000007fffffL, 0x0000000000ffffffL, 0x0000000001ffffffL, 0x0000000003ffffffL, 0x0000000007ffffffL, 0x000000000fffffffL, 0x000000001fffffffL, 0x000000003fffffffL, 0x000000007fffffffL, 0x00000000ffffffffL, 0x00000001ffffffffL, 0x00000003ffffffffL, 0x00000007ffffffffL, 0x0000000fffffffffL, 0x0000001fffffffffL, 0x0000003fffffffffL, 0x0000007fffffffffL, 0x000000ffffffffffL, 0x000001ffffffffffL, 0x000003ffffffffffL, 0x000007ffffffffffL, 0x00000fffffffffffL, 0x00001fffffffffffL, 0x00003fffffffffffL, 0x00007fffffffffffL, 0x0000ffffffffffffL, 0x0001ffffffffffffL, 0x0003ffffffffffffL, 0x0007ffffffffffffL, 0x000fffffffffffffL, 0x001fffffffffffffL, 0x003fffffffffffffL, 0x007fffffffffffffL, 0x00ffffffffffffffL, 0x01ffffffffffffffL, 0x03ffffffffffffffL, 0x07ffffffffffffffL, 0x0fffffffffffffffL, 0x1fffffffffffffffL, 0x3fffffffffffffffL, 0x7fffffffffffffffL, 0xffffffffffffffffL,};

public final static long[] GET_N_BITS_FROM_LEFT = {0x0000000000000000L, 0x8000000000000000L, 0xc000000000000000L, 0xe000000000000000L, 0xf000000000000000L, 0xf800000000000000L, 0xfc00000000000000L, 0xfe00000000000000L, 0xff00000000000000L, 0xff80000000000000L, 0xffc0000000000000L, 0xffe0000000000000L, 0xfff0000000000000L, 0xfff8000000000000L, 0xfffc000000000000L, 0xfffe000000000000L, 0xffff000000000000L, 0xffff800000000000L, 0xffffc00000000000L, 0xffffe00000000000L, 0xfffff00000000000L, 0xfffff80000000000L, 0xfffffc0000000000L, 0xfffffe0000000000L, 0xffffff0000000000L, 0xffffff8000000000L, 0xffffffc000000000L, 0xffffffe000000000L, 0xfffffff000000000L, 0xfffffff800000000L, 0xfffffffc00000000L, 0xfffffffe00000000L, 0xffffffff00000000L, 0xffffffff80000000L, 0xffffffffc0000000L, 0xffffffffe0000000L, 0xfffffffff0000000L, 0xfffffffff8000000L, 0xfffffffffc000000L, 0xfffffffffe000000L, 0xffffffffff000000L, 0xffffffffff800000L, 0xffffffffffc00000L, 0xffffffffffe00000L, 0xfffffffffff00000L, 0xfffffffffff80000L, 0xfffffffffffc0000L, 0xfffffffffffe0000L, 0xffffffffffff0000L, 0xffffffffffff8000L, 0xffffffffffffc000L, 0xffffffffffffe000L, 0xfffffffffffff000L, 0xfffffffffffff800L, 0xfffffffffffffc00L, 0xfffffffffffffe00L, 0xffffffffffffff00L, 0xffffffffffffff80L, 0xffffffffffffffc0L, 0xffffffffffffffe0L, 0xfffffffffffffff0L, 0xfffffffffffffff8L, 0xfffffffffffffffcL, 0xfffffffffffffffeL, 0xffffffffffffffffL,};

//Sequence left
public long l0;
//Sequence right
public long l1;

public BitBoard(long l0, long l1) {
    this.l0 = l0;
    this.l1 = l1;
}

public BitBoard and(BitBoard b) {
    return new BitBoard(l0 & b.l0, l1 & b.l1);
}

public void andEquals(BitBoard b) {
    l0 &= b.l0;
    l1 &= b.l1;
}

public BitBoard or(BitBoard b) {
    return new BitBoard(l0 | b.l0, l1 | b.l1);
}

public void orEquals(BitBoard b) {
    l0 |= b.l0;
    l1 |= b.l1;
}

public BitBoard not() {
    return new BitBoard(~l0, ~l1);
}

public void notEquals() {
    l0 = ~l0;
    l1 = ~l1;
}

public BitBoard rightShift(int amount) {
    if (amount <= 63) {
        return new BitBoard(l0 >>> amount, l1 >>> amount | ((l0 & GET_N_BITS_FROM_RIGHT[amount]) << (64 - amount)));
    } else {
        return new BitBoard(0, l0 >>> (amount - 64));
    }
}

public void rightShiftEquals(int amount) {
    if (amount <= 63) {
        l1 = l1 >>> amount | ((l0 & GET_N_BITS_FROM_RIGHT[amount]) << (64 - amount));
        l0 = l0 >>> amount;
    } else {
        l1 = l0 >>> (amount - 64);
        l0 = 0;
    }
}

public BitBoard leftShift(int amount) {
    if (amount <= 63) {
        return new BitBoard(l0 << amount | ((l1 & GET_N_BITS_FROM_LEFT[amount]) >>> (64 - amount)), l1 << amount);
    } else {
        return new BitBoard(l1 << (amount - 64), 0);
    }
}

public void leftShiftEquals(int amount) {
    if (amount <= 63) {
        l0 = l0 << amount | ((l1 & GET_N_BITS_FROM_LEFT[amount]) >>> (64 - amount));
        l1 = l1 << amount;
    } else {
        l0 = l1 << (amount - 64);
        l1 = 0;
    }
}

public BitBoard xOr(BitBoard b) {
    return new BitBoard(b.l0 ^ l0, b.l1 ^ l1);
}

public void xOrEquals(BitBoard b) {
    l0 ^= b.l0;
    l1 ^= b.l1;
}

public int popCount() {
    return Long.bitCount(l0) + Long.bitCount(l1);
}

public boolean equalsZero() {
    return l1 == 0 && l0 == 0;
}

public int numberOfTrailingZeros() {
    int l1Trail = Long.numberOfTrailingZeros(l1);
    if (l1Trail == 64) {
        return 64 + Long.numberOfTrailingZeros(l0);
    } else {
        return l1Trail;
    }
}

public BitBoard unsetBit(int bit) {
    if (bit <= 63) {
        return new BitBoard(l0, l1 & ~(1L << bit));
    } else {
        return new BitBoard(l0 & ~(1L << (bit - 64)), l1);
    }
}

public void unsetBitEquals(int bit) {
    if (bit <= 63) {
        l1 &= ~(1L << bit);
    } else {
        l0 &= ~(1L << (bit - 64));
    }
}}

It is to note that I have to use these operations very often and I rely entirely on their speed. However, most of the time I am unable to use the in-place methods and simple operations like add and shift will make new Objects. This results in a massive overhead of about 20% runtime which is used to initialize this data structure (see picture below).

Overhead generateded by intialization

Is there any other way to optimize this?

Also, is this code snippet

BitBoard bb;
BitBoard bb2;
BitBoard bb3;
BitBoard res = bb.and(bb2).not().xOr(bb3)

slower than

BitBoard bb;
BitBoard bb2;
BitBoard bb3;
BitBoard res=bb;
res.andEquals(bb2);
res.notEquals();
res.xOrEquals(bb3);

since it is allocating new memory for the intermediate steps?

EDIT:

I have been benchmarking my methods with JMH.

Benchmark 1 tests the method in-place:

public class MyBenchmark {

@State(Scope.Thread)
public static class Status{
    BitBoard[] arr;
    @Setup(Level.Trial)
    public void init(){
        arr= new BitBoard[1000];
        for(int i=0;i<arr.length;i++){
            arr[i]= new BitBoard((long)(Math.random()*Integer.MAX_VALUE),i);
        }
    }
}
@Benchmark @OutputTimeUnit(TimeUnit.NANOSECONDS) @BenchmarkMode(Mode.AverageTime)
public BitBoard[] testMethod(Status s) {
    BitBoard[] res= new BitBoard[s.arr.length];
    for(int i=0;i<s.arr.length;i++){
        res[i]= new BitBoard(0,0);
        for(int j=i+1;j<s.arr.length-1;j++){
            res[i].andEquals(s.arr[j]);
            res[i].andEquals(s.arr[j-1]);
            res[i].xOrEquals(s.arr[j+1]);
        }
    }
    return res;
}

}

Result: Benchmark 1 Results

The second benchmark does not use the in-place methods.

public class MyBenchmark {

@State(Scope.Thread)
public static class Status{
    BitBoard[] arr;
    @Setup(Level.Trial)
    public void init(){
        arr= new BitBoard[1000];
        for(int i=0;i<arr.length;i++){
            arr[i]= new BitBoard((long)(Math.random()*Integer.MAX_VALUE),i);
        }
    }
}
@Benchmark @OutputTimeUnit(TimeUnit.NANOSECONDS) @BenchmarkMode(Mode.AverageTime)
public BitBoard[] testMethod(Status s) {
    BitBoard[] res= new BitBoard[s.arr.length];
    for(int i=0;i<s.arr.length;i++){
        for(int j=i+1;j<s.arr.length-1;j++){
            res[i]=s.arr[j].and(s.arr[j-1]).xOr(s.arr[j+1]);
        }
    }
    return res;
}

}

Benchmark 2 results

It appears that the in-place methods do provide a speed-up!

Why bother creating new objects at all when you could modify the one calling the method(s)? — Jacob G., Dec 21 '18 at 23:10
why are your `or` `and` etc... methods return `BitBoard` instead of say `long` ? — David Soroko, Dec 21 '18 at 23:10
@JacobG. I can not always do this since I still need the old objects for future reference — Fabian v.d.W, Dec 21 '18 at 23:14
You need to test this in a real-world scenario (use your objects in some algorithm) and when you measure timing, make sure that you've warmed up the VM by having run the algorithm a few times. Hotspot is capable of interesting optimizations, it may allocate your objects on the stack rather than the heap in some cases. See for benchmarks: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — Erwin Bolwidt, Dec 21 '18 at 23:16
@ErwinBolwidt I have already done benchmarks, see here: https://imgur.com/a/ySRA7Kd Is this what you menat? — Fabian v.d.W, Dec 21 '18 at 23:18
No, please read the post I linked. Any tool that gives you a detailed breakdown like this, instruments the application, and disrupts HotSpot optimizations which affects the performance of the application - often significantly. — Erwin Bolwidt, Dec 21 '18 at 23:44

score 2 · Accepted Answer · answered Dec 22 '18 at 02:18

What you did is profiling rather than benchmarking. For benchmarking, there's JMH which is close to perfect. I'm not sure about profilers, but most of them lie. A lot.

In case you really need to avoid allocations, you may re-use some object in tight loops. You should definitely use no pooling as for such tiny objects it has way more overhead the allocation and GC together.

How to minimize allocations

I strongly dislike your naming, so I'll use my own. You could extend the set of your operations like this

void assign(BitBoard that) {
    this.high = that.high;
    this.low = that.low;
}

void inplaceAnd(BitBoard that) {
    this.high &= that.high;
    this.low &= that.low;
}

void inplaceAndNot(BitBoard that) {
    this.high &= ~that.high;
    this.low &= ~that.low;
}

Then you can move allocations out of tight loops (at the price of making the code more ugly).

BitBoard tmp = new BitBoard(0, 0);
BitBoard result = new BitBoard(0, 0);
for (...) {
    // Let's say, you get a, b and c as inputs.
    // You should compute a&b | a&~b
    // Let's assume, none of a, b, c may be overwritten.
    tmp.assign(a);
    tmp.inplaceAnd(b);
    result.assign(a);
    result.inplaceAndNot(c);
    result.inplaceOr(tmp);    
}

Why you should not minimize allocations

All these inplace operations make the code more error-prone and much less readable than using immutables like in

BitBoard result = a.and(b).or(a.andNot(c));

Also, is this code snippet ... slower than ... since it is allocating new memory for the intermediate steps?

You have to answer your own question yourself, as all we can say is "probably yes, but usually it's negligible". In your case it may matter, but the only way to tell is to benchmark your case. Forget the profiler and let JMH compare the two versions. The JVM may optimize most of the allocations away where it matters.

Optimizing 128bit Sequence bitwise operations in Java

1 Answers1

How to minimize allocations

Why you should not minimize allocations