You have two or three different things here that you're conflating.
- training file format
- in-memory storage format for the training set after parsing (if you need to keep parsed states around for future reference)
- unpacked representation of a single board state + optional? W/L/D flag
All three of these formats can be different. The training file can be text for easy editing. Of course, even if your main program reads the training set in a binary format, that binary could be "compiled" from an easily-edited text format by a separate tool.
internal representation for working with a single board position:
This needs to be fast to access and loop over. Since you're training a neural net, not writing an AI directly, you may not need to work with this representation very much. If you don't need to do more than apply each element to a neural net input, there's no point having a separate format: just unpack directly from a more compact representation to the neural net inputs.
However, if you have to loop over a single board state multiple times, you have some interesting options. As multiple people have pointed out, the win/loss/draw/undecided flag should be thought of separately from the board state. So you'd have one flag per board, not storage for the flag in every board position.
bit-board: I've read about chess engines (e.g. crafty) using a 64bit unsigned int to store where all the white pawns are, for example. You can bitwise OR bitmaps together to find where all the white pieces of any kind are.
Bitmaps (one for o
, one for x
) will record the entire state. A connect-4 board has 6*7 grid locations, so each bitmap can be 64bits, but 32b is too small. popcount(board.o)
tells you how many os are on the board. assert(o & x == 0)
will be a good sanity-check, because there can never be an o and an x in the same position.
Using two packed 42b fields in a struct would be a bad idea, since it would be slow to load/store. Even packing them into 48bit fields (so they end on byte boundaries) would lead to slower loading/storing. Remember, this is our fast format. We can use a packed format for long-term storage.
Something like board[0][0] && board[0][1] && board[0][2] && board[0][3]
(although not with that syntax) with compile-time-constant positions is very fast on a bit-board. One bitwise AND leaves just those bits possibly-set, and then you can compare against the mask to see if all the bits were set. To test ||
instead of &&
, leave out the second step. You can do these tests against the o or x bitmaps, or o|x
to check for either kind of piece. However, this isn't as efficient if you have to build the mask at run-time from variable positions.
To scan a board for wins, you might check the left column, then bit-shift your mask so it's checking the next column. Actually brute-force checking all columns like this is probably slower than checking the neighbours of markers, looking for 2-in-a-row candidates.
Some operations might be easier if the bitmap is fully 64bit, representing an 8x8 board, but you only actually use the lower-left 7x6 of it. That way, separate columns are in separate bytes of the 64bit integer. Having each column in a separate byte is probably more useful than row, because finding the highest used position in a column is a thing you might want to do. It's just a find first set bit operation on the column. Extracting 8bit chunks from a bitmap is faster (doesn't require masking). You could unpack a 42bit bitmap to separate variables for each column, though. On x86, where the first 4 registers are byte addressable for the first and second 8 bit chunks (AX (the low16 of RAX) is made up of AL and AH), you (or the compiler, but they're probably not this clever) can store 7 columns in 4 registers, and still be able to bsr
(bit-scan-reverse) any column separately.
// Sample bitboard implementation:
struct game_state {
struct board_state {
uint64_t o, x;
} board;
enum winlose { GAME_UNDECIDED=0, GAME_WIN_O, GAME_WIN_X, GAME_DRAW } victory;
};
array of 2-bit fields: don't use. A similar implementation would use a 2-bit bitfield for every position. It's impossible to use with nice board[row][col]
syntax in C, and 42*2 bits doesn't fit in a single register. Interleaving the bitboards gives no advantage, and makes some things a lot worse, esp. because the whole thing doesn't fit in 64bits. (If you want to look for unoccupied spaces in the bitboard version, you look for zero bits in o|x
. Here, you have to check every pair of 2 bits, instead of being able to use one bitwise to fit the whole problem into a register. Still, you could make a macro to shift/mask the 2 bits representing a given row/column. It wouldn't make efficient code.
array of bytes: It may be faster to use this format for looping around checking the neighbours of a given board position. In a bitboard, testing board[i][j] && board[i][j+1]
could be done by bit-shifting the board so the two bits of interest line, then bitwise AND, then bit-test that bit. At least on x86, there are addressing modes with small byte offsets, so given the address of one board position, ANDing another board position might only take a single instruction.
In each byte, one bit represents x, another represents o, and another bit should be set if the location is either x or o. This allows checking that multiple positions are all occupied by ANDing them together, and checking the occupied bit. Otherwise, you'd have to check if either the x or o bit was set in each grid.
// Sample byte-array implementation:
enum boardpos {
POS_EMPTY = 0,
POS_O = 1<<0,
POS_X = 1<<1,
POS_OCCUPIED = 1<<3
};
// maybe #define for these constants instead?
struct game_state {
struct board_state {
uint8_t pos[6][7];
} board;
enum winlose { GAME_UNDECIDED=0, GAME_WIN_O, GAME_WIN_X, GAME_DRAW } victory;
// or maybe stuff the winlose info into the high bits of board.pos[0][0]?
// Not much point, since the struct will probably be the same size after padding anyway.
};
File-format representation:
A more compact but still easy to use format would be xbbb...ooxbbw
. Then you don't have to parse lines as strings, just as a constant-size chunk of 43 characters (or 43 if each record is separated by a newline). If you have any board positions which aren't wins, losses, or draws, use another character to mark that. Space, or 'n'
.
Just leaving out the commas cuts your size nearly in half. You don't want to have to parse the input on commas and stuff. There are potential further gains from a simple run-length encoding notation, like xb2o1xb1w
. Seeing a digit means repeat the last character that many more times. And maybe x
means one x, capital X
means two xes. That's getting to the point that it's hard for a human to read. LZOP or LZ4 compression could probably do a good job of compacting things.
Binary file formats:
Obviously there's a limit to text representations. Fixed-size binary records can be very small, because there isn't much information to be stored. Using 2 bits per grid position is probably compact enough, but there's still redundancy because that can represent the impossible state of x and o in the same position. To do better, you'd need to map the state of the whole board, or a whole row or column, to a multi-bit representation. Wikipedia says there are 4,531,985,219,092 legal positions for all game boards populated with 0 to 42 pieces. That's just over 2^42. So 43 bits should be enough to represent any valid board state, including all not-yet-decided positions. IDK how to go about encoding games into 43-bit integers, at least not anything usable (i.e. faster than actually enumerating all possible games and stopping at the one that matches.)
If you use bit-boards as your internal fast representation, store them packed in your file format, so the o
and x
boards plus the w/d/d status fits in 12 bytes, or 16bytes if you like round numbers.
// do some pre-processor stuff to choose between GNU C __attribute__ ((__packed__))
// and the MSVC #pragma pack
struct __attribute__ ((__packed__)) compact_game_state {
struct __attribute__ ((__packed__)) compact_board_state {
uint64_t o:42, x:42;
} board; // sizeof = 11
uint8_t victory;
}; // sizeof = 12
struct semi_compact_game_state {
struct __attribute__ ((__packed__)) semi_compact_board_state {
uint64_t o:48, x:48;
} board; // 96 bits = 12 bytes
enum winlose victory; // another 4 bytes
};
These actually compile with g++: see on godbolt.
Do your I/O with endian-agnostic code, so it doesn't break on big-endian machines. It's a file format, so it doesn't actually matter how you access it, as long as the right bytes go in the right place. Little-endian is probably a good choice for the file format, so on little-endian machine the load/store code is a no-op. Or just be lazy and do binary I/O on an array of structs, and only use your code on machines with the same endianness as the training data set.
If you don't use bit-boards, it's probably best to use an array of 2-bit fields. Random-access can be slow, but converting this to a byte-array will probably be faster than for two separate bitfields. Mask off the low 2 bits, use that as an index into a lookup table of { POS_EMPTY, POS_O|POS_OCCUPIED, POS_X|POS_OCCUPIED }
. Then bit-shift by two to bring the next field into the low position. The board takes 84 bits, so do it in separate 32 or 64bit chunks. Doing a 128bit double-shift is not needed. The win/lose/draw info can go in a final 2-bit chunk.
Pack this into a 12 byte struct of three uint32_t, or a uint64_t and a uint32_t or something. Or just an array of uint8_t, but then it's harder to get the compiler to do one wide load. You could pack things into an 11 byte struct, but then alignment is more of a problem. If saving 1/12th memory size makes a useful difference to caching, go for it. A load that crosses a cache line only costs a couple extra cycles on x86 CPUs, and you're not loading very often. (A 64bit load of the first chunk won't be aligned to 64b anyway, with 12B structs, but it will at least split right down the middle, which is a special case that's faster than an uneven cache-line split on some CPUs.)
Decoding separate bitboards into a byte array would require shifting each bitboard separately, I think. It could still be branchless, but not as nice.
in-memory storage for the database of games
Converting between representations takes CPU time, so just go from file format to internal if it's not useful.
It's probably only useful to have a separate format for this if you keep a text file format, and use a non-compact fast format. In that case, parsing the file into packed 2-bit positions (like for that file format, see above).
If your file format is binary, just keep that around. (Or even just memory-map the file.)