Array referencing speed vs bitwise operators

Question

I am currently making a chess game and I am storing my information on two 8 by 8 arrays. The first stores team type (1 for white, 0 for nothing, -1 for black), and the second stores piece type (1 for pawn, 2 for rook, etc...). Since I am programming an AI for the game, I was wondering if I could combine these two maps together, with one 8 by 8 array. The information would be stored in one byte like this:

00 A BC DEF

where A represents if it moved, BC represents the team:

00 A 01 DEF for white,

00 A 10 DEF for black,

00 A 00 DEF for nothing

and DEF represents the piece type in a similar manner. I would access these values by & ing a mask with the byte. So here is my question. Would a computer be able to access information faster with a smaller array, or would the bitwise function make it run too slow? Memory is not an issue for me.

Profile it. It will likely depend on your system, exactly how much pressure the CPU cache is under, etc. Bitwise ops are super cheap, but so are memory accesses in a small, contiguous space. — ShadowRanger, Jan 27 '16 at 23:50
Your encoding is a bit too liberal. There are 6 different pieces of two colors, and an empty space. Obviously it fits in 4 bits: leftmost for the color (1-white, 0-black) and the rest for the name (0-nothing, 1-pawn, etc.). Since "nothing" has no color, `1000` and `0000` are empty fields (or you may choose to not use `1000`). Now, do you really need the "moved" bit? I don't know much about chess, but I think the strategy depends on the position only, regardless of which piece was last moved. There is not much difference between 128, 64 or 32 bytes; there *might* be if you have 100,000 of them. — Vlad Feinstein, Jan 28 '16 at 21:45

score 3 · Accepted Answer · answered Jan 27 '16 at 23:59

The answer here is "yes, no and maybe". It REALLY depends on how good the compiler is, what processor it is run on, and which is most important, space or time.

My instinct would be that storing will be slower in a bitmask, retrieving is about the same whether you store a byte at a time or bitwise pattern.

The reason being that store involves:

  value = (value & ~ bits) | new_value;

where bits is the mask of the bits you want to use (e.g. for the color it would be 0x18 or 00011000.

Checking the value involves a single and operetion:

  value & bits;

On the other hand, if you use three [or four] bytes to store each value, you get a single byte store and single byte load for each element. That's about as fast as anything can be on a modern machine.

Affecting this is also how much cache the processor has - fetching from main memory or even doing a L2 cache-fetch is slower than two or three extra simple instructions.

But the real answer for anything like this is: write some code and measure the speed. Make sure that you do this REALISTIC, so that you have the right amount of other instructions, the right amount of data (working on 5 bytes is not the same as building the possible moves for 6 moves ahead with 100000 choices [full boards] calculated for each move-level).

score 2 · Answer 2 · edited May 23 '17 at 12:31

Your choice of DEF as the bottom 3 bits is wise. You might want to use them to index a table, so only needing to mask, not shift-and-mask, is good.

If you ever want to index a table with the BC bits, putting them as the upper 2 bits would mean you only need to shift, and not mask.

Or, if you often want to test the A bit, putting it as the MSB could save a tiny amount of code-size. test al,al/js (branch on the sign bit) is slightly shorter than test al, 0x20 / jnz (branch on bit 5), because of the immediate byte. The only speed difference is in the x86 code size, though, so it's very minor.

What is the efficient way to count set bits at a position or lower? took advantage of a similar trick, where the compiler can use an arithmetic right-shift to broadcast the sign bit to the whole register. This can then be used as an AND-mask.

Bitwise operations are cheap. If you often need both pieces of information (piece type and color), you'll probably be better off with packing your maps together, rather than having two separate arrays. Four 16B vector registers can hold a whole board. Loading from memory that's in L1 cache is extremely cheap on x86, maybe even cheaper than a couple ALU ops in registers if there's a lot of parallelism and your code is already bottlenecked on 3 or 4 ALU operations per clock. (Intel SnB / Haswell). So it might possibly be a win to store it separately, but only if everything really is hot in L1. Otherwise, go for the data density every time.

Many chess programs store board positions as a Bitboard, where each position of a 64bit int represents a board position. So you might have a mask of "black pawns" and "white pawns". You can OR them together to get a bitboard of multiple pieces. You can AND them to see where they intersect.

I think you can use clever bitwise stuff to check the diagonals the bishops can attack, and things like that. 32 and 64bit immediate constants are cheap.

Array referencing speed vs bitwise operators

2 Answers2