Chess evaluation Neural Network is converging to the average

Question

I'm currently working on a Chess AI. The idea behind this project is to create a neural network that learns how to evaluate a board state and then traverse the next moves using Monte Carlo tree search to find the "best" move to play (evaluated by the NN).

Code on GitHub

TL;DR

The NN gets stuck predicting the average evaluation of the dataset and is thereby not learning to predict the evaluation of the board state.

Implementation

Dataset

The dataset is a collection of chess games. The games are fetched from the official lichess database. Only games which have a evaluation score (which the NN is supposed to learn) are included. This reduces the size of the dataset to about 11% of the original.

Data representation

Each move is a datapoint to train the network on. The input for the NN are 12 arrays of size 8x8 (so called Bitboards), one for each of the 6x2 different pieces and colors. The move evaluation is normalized to the range [-1, 1] using a scaled tanh function. Since many evaluations are very close to 0 and -1/1, a percentage of these are dropped aswell, to reduce the variation in the dataset.

Without dropping some of the moves with evaluation close to 0 or -1/1 the dataset would look like this: without dropping graph

With dropping some, the dataset looks like this and is a lot less focused at one point: with dropping graph

The output of the NN is a single scalar value between -1 and 1, representing the evaluation of the board state. -1 meaning the board is heavily favored for the black player, 1 meaning the board is heavily favored for the white player.

def create_training_data(dataset: DataFrame) -> Tuple[np.ndarray, np.ndarray]:
    def drop(indices, fract):
        drop_index = np.random.choice(
            indices,
            size=int(len(indices) * fract),
            replace=False)
        dataset.drop(drop_index, inplace=True)

    drop(dataset[abs(dataset[12] / 10.) > 30].index, fract=0.80)
    drop(dataset[abs(dataset[12] / 10.) < 0.1].index, fract=0.90)
    drop(dataset[abs(dataset[12] / 10.) < 0.15].index, fract=0.10)

    # the first 12 entries are the bitboards for the pieces
    y = dataset[12].values
    X = dataset.drop(12, axis=1)

    # move into range of -1 to 1
    y = y.astype(np.float32)
    y = np.tanh(y / 10.)

    return X, y

The neural network

The neural network is implemented using Keras.

The CNN is used to extract features from the board, then passed to a dense network to reduce to an evaluation. This is based on the NN AlphaGo Zero has used in its implementation.

The CNN is implemented as follows:

model = Sequential()
model.add(Conv2D(256, (3, 3), activation='relu', padding='same', input_shape=(12, 8, 8, 1)))

for _ in range(10):
    model.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
    model.add(BatchNormalization())

model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
# model.add(Rescaling(scale=1 / 10., offset=0)) required? Data gets scaled in create_training_data, does the Network learn that/does doing that explicitly help?
model.add(Dense(units=1, activation='tanh'))
model.compile(
    loss='mean_squared_error',
    optimizer=Adam(learning_rate=0.01),
    # metrics=['accuracy', 'mse'] # do these influence training at all?
)

Training

The training is done using Keras. Multiple sets of 50k-500k moves are used to train the network. The network is trained for 20 epochs on each move set with a batchsize of 64 and 10% of moves are used for validation.

Afterwards the learning rate is adjusted by 0.001 / (index + 1).

for i, chunk in enumerate(pd.read_csv("../dataset/nm_games.csv", header=None, chunksize=100000)):
    X, y = create_training_data(chunk)

    model.fit(
        X,
        y,
        epochs=20,
        batch_size=64,
        validation_split=0.1
    )
    
    model.optimizer.learning_rate = 0.001 / (i + 1)

Issues

The NN currently does not learn anything. It converges within a few epochs to a average evaluation of the dataset and does not predict anything depending on the board state.

Example after 20 epochs:

Dataset Evaluation	NN Evaluation	Difference
-0.10164772719144821	0.03077016	0.13241789
0.6967725157737732	0.03180310	0.66496944
-0.3644430935382843	0.03119821	0.39564130
0.5291759967803955	0.03258476	0.49659124
-0.25989893078804016	0.03316733	0.29306626

The NN Evaluation is stuck at 0.03, which is the approximate average evaluation of the dataset. It is also stuck there, not continuing to improve.

loss graph

What I tried

Increased and decreased NN size
- Added up to 20 extra Conv2D layers since google did that in their implementation aswell
- Removed all 10 extra Conv2D layers since I read that many NN are too complex for the dataset
Trained for days at a time
- Since the NN is stuck at 0.03, and also doesn't move from there, that was wasted.

Dense NN instead of CNN

Did not eliminate the point where the NN gets stuck, but trains faster (aka. gets stuck faster :) )

  model = Sequential()
  model.add(Dense(2048, input_shape=(12 * 8 * 8,), activation='relu'))
  model.add(Dense(2048, activation='relu'))
  model.add(Dense(2048, activation='relu'))
  model.add(Dense(1, activation='tanh'))
  model.compile(
      loss='mean_squared_error',
      optimizer=Adam(learning_rate=0.001),
      # metrics=['accuracy', 'mse']
  )

Sigmoid activation instead of tanh Moves evaluation from a range of -1 to 1 to a range of 0 to 1 but otherwise did not change anything about getting stuck.
Epochs, batchsize and chunksize increased and decreased All of these changes did not significantly change the NN evaluation.
Learning Rate addaption
- Larger learning rates (0.1) made the NN unstable, each time training, converging to either -1, 1 or 0.
- Smaller learning rates (0.0001) made the NN converge slower, but still stuck at 0.03.

Code on GitHub

Question

What to do? Is there something I'm missing or is there an error?

what is the train loss vs crossval vs test loss? Did you randomize the data's order? Have you tried training for learning the evaluation without tanh normalization? — Juan Carlos Ramirez, Nov 25 '21 at 22:31
- Train loss = crossval = test loss = 0.22 once the model has setteled (5 epochs at max required to get there). - Using Keras model.fit the parameter shuffle=True per default, I belief that's used to rendomize the order. - I trained using sigmoid normalization and activation and didn't get a different result. I would not know how to train without normalization since the output is in range -100 to 100. — Bertil Braun, Nov 25 '21 at 22:40

score 1 · Answer 1 · answered Jan 24 '22 at 15:40

1

my two suggestions:

use the full dataset and score each position based on the fact if that player won the game or not. i don't know this dataset and there might be something with the evaluations by others ( or are they verified?) even if you are sure about the validity of it i would test this as it can provide some more information on what the problem might be
Check your data representation. probably you already did this a couple of times but i can tell you from experience it is easy to introduce one and to overlook them. adding a test might help you in the long run. some of my problems:
- indication of current player colour? not sure if you have a player colour plane or you switch current player pieces?
- incorrect translation from 1d to 3d or vice-versa. (should not prevent you from training but saves you a lot of time if you want to port to a different device)
- I trained a go game engine and do not know what representation is used for chess, it took me some time to figure out a good representation for checkers.

not a solution but i found that cyclic learning rates worked great for my go-engine might be something to look at when the rest works

answered Jan 24 '22 at 15:40

MaMiFreak

789
2
11
26

The evaluations are genereted by the Stockfish Engine [dataset](https://database.lichess.org/) therefore, I'm quite confident that these are valid. Data Representation are Bit Boards [wiki](https://en.wikipedia.org/wiki/Bitboard) with +1 for the White and -1 for the Black Player. They then are flattened to one Array of length 768. That's the Point where a translation from 3d to 1d happens, which I hope is reversed by adding a `Reshape((12, 8, 8, 1), input_shape=(12 * 64,))` Layer, which should reformat to 12 Bit Boards at the very beginning of the Network. Is there an issue with that? – Bertil Braun Jan 24 '22 at 19:44
that sounds legit. still i would try it just to eliminate it as a possibility. reshape looks good, did you validate the bitboard is correctly generated? at a glance your method and code looks good so it process of elimination to figure out where the problem is – MaMiFreak Jan 25 '22 at 10:47
i just realized you do not encode player turn in any way. a position where it is black turn will have a different evaluation then when it is white. – MaMiFreak Jan 25 '22 at 11:27
Yes, printing the Bitboards shows the correct result. And yes, it's true, the turn is missing, though that shouldn't be the reason why the Evaluation almost instantly converges to some value. Do you know of any reason why it converges so quickly? It probably doesn't find any pattern to the dataset, right? – Bertil Braun Jan 25 '22 at 14:43
yes that is what it looks like to me. i never trained a model with only the current board state, i always added some sort of history or a current player indication. to me it seems you need it but i never tested it so i am not 100% sure. – MaMiFreak Jan 25 '22 at 15:49
Alright, I'll definitely try that out. The Idea right now is to flip the evaluation for Blacks turn so that it's +1 for White and -1 for Black when it's the White Players turn and -1 for White and +1 for Black when it's the Black Players turn. But therefore won't add another variable indicating the turn state itself. – Bertil Braun Jan 25 '22 at 19:51
After rewriting, getting the dataset again and retraining it 3 times until it got completly stuck again... The same problem persists. – Bertil Braun Jan 25 '22 at 21:31
Allright, I continued to play around with model size, learning rate and checked the data again (might have had flipped evaluations for the Black Player) and it's currently at least not getting completely stuck right from the start. I'll start training on my server first thing in the morning and let it train over the weekend. I hope to see some improvements after that :) – Bertil Braun Jan 27 '22 at 22:52
great news=) keep me posted. question about the change you made, did you change te training target or the input? – MaMiFreak Jan 28 '22 at 09:21
Flipped the target, since that was not getting flipped when flipping the board. But... Did an oupsi and the program crashed on Friday, therefore basically no training occured. Today I tested out replacing the CNN with a Dense Deep NN and seem to have gotten better results. The error is still quite high though.. Currently don't know if it just requires more Data/Trainingtime or if it's a structural or parameter issue. Currently 10 Layers, 50k Games per Learning, 50 Epochs per Learning and still on average a delta of 0.3 (for a output of -1 to 1) for the predictions. – Bertil Braun Jan 31 '22 at 20:12
Alright, we're currently not doing too bad. The loss is down to 0.07 and I've grately reduced the model size to speed up prediction. I'm relatively confident, that longer training times and filtering of higher elo games would most definitely improve the Model. The issue at the moment is, that prediciting just 1000 games each seperately takes over a minute while predicting the all at one takes half a second. What's the issue there? Any clue? – Bertil Braun Feb 05 '22 at 12:22
sounds good, what did you change to get this result? target predictions only or also something in the input? – MaMiFreak Feb 07 '22 at 08:09
i assume you are predicting one position at a time vs a batches? your gpu has enough capacity to infer multiple positions at once if you are not batching the predictions quite some capacity is not used. i'd need to know the exact numbers of positions and batch size in order to get an idea if your numbers seem right – MaMiFreak Feb 07 '22 at 08:16
Flipped the Target and started out using a lower training rate and let it train longer. – Bertil Braun Feb 07 '22 at 16:34
And yes, one at a Time. I'm currently rewriting to try to batch together some of the predictions. I'll let you know if I get somewhere – Bertil Braun Feb 07 '22 at 16:35
Batching definitely helps with prediction Time. Well.. even though that works, using Monte Carlo Tree Search together with about 70k evaluations still plays like sh***. Even I myself can Checkmate it (and that means something..). So idk anymore. But definitely a cool project anyway. – Bertil Braun Feb 16 '22 at 10:17