2

I am building an Actor-Critic neural network model in pytorch in order to train an agent to play the game of Quoridor (hopefully). For this reason, I have a neural network with two heads, one for the actor output which does a softmax on all the possible moves and one for the critic output which is just one neuron (for regressing the value of the input state).

Now, in quoridor, most of the times not all moves will be legal and as such I am wondering if I can exclude output neurons on the actor's head that correspond to illegal moves for the input state e.g. by passing a list of indices of all the neurons that correspond to legal moves. Thus, I want to not sum these outputs on the denominator of softmax.

Is there a functionality like this on pytorch (because I cannot find one)? Should I attempt to implement such a Softmax myself (kinda scared to, pytorch probably knows best, I ve been adviced to use LogSoftmax as well)?

Furthermore, do you think this approach of dealing with illegal moves is good? Or should I just let him guess illegal moves and penalize him (negative reward) for it in the hopes that eventually it will not pick illegal moves?

Or should I let the softmax be over all the outputs and then just set illegal ones to zero? The rest won't sum to 1 but maybe I can solve that by plain normalization (i.e. dividing by the L2 norm)?

Michael
  • 325
  • 3
  • 14
  • 1
    I would suggest just penalizing him for using illegal rewards. I think that is what usually people do and it seems the best strategy to help them completely learn the game. – Dwight Foster Apr 03 '21 at 13:19
  • that is not the correct way to go, you should always constrain the search space to valid solutions – KonstantinosKokos Apr 03 '21 at 18:36

2 Answers2

3

An easy solution would be to mask out illegal moves with a large negative value, this will practically force very low (log)softmax values (example below).

# 3 dummy actions for a batch size of 2
>>> actions = torch.rand(2, 3)     
>>> actions
tensor([[0.9357, 0.2386, 0.3264],
        [0.0179, 0.8989, 0.9156]])
# dummy mask assigning 0 to valid actions and 1 to invalid ones
>>> mask = torch.randint(low=0, high=2, size=(2, 3))
>>> mask
tensor([[1, 0, 0],
        [0, 0, 0]])
# set actions marked as invalid to very large negative value
>>> actions = actions.masked_fill_(mask.eq(1), value=-1e10)
>>> actions
tensor([[-1.0000e+10,  2.3862e-01,  3.2636e-01],
        [ 1.7921e-02,  8.9890e-01,  9.1564e-01]])
# softmax assigns no probability mass to illegal actions
>>> actions.softmax(dim=-1)
tensor([[0.0000, 0.4781, 0.5219],
        [0.1704, 0.4113, 0.4183]])
KonstantinosKokos
  • 3,369
  • 1
  • 11
  • 21
0

I'm not qualified to say if this is a good idea, but I had the same one and ended up implementing it.

The code is using rust's bindings for pytorch, so it should be directly translatable to python based pytorch.

/// As log_softmax(dim=1) on a 2d tensor, but takes a {0, 1} `filter` of the same shape as `xs`
/// and has the softmax only look at values where filter[idx] = 1.
///
/// The output is 0 where the filter is 0.
pub fn filtered_log_softmax(xs: &Tensor, filter: &Tensor) -> Tensor {
    // We are calculating `log softmax(xs, ys)` except that we only want to consider
    // the values of xs and ys where the corresponding `filter` bit is set to 1.
    //
    // log_softmax on one element of the batch = for_each_i log(e^xs[i] / sum_j e^xs[j]))
    //
    // To filter that we need to remove (zero out) elements that are being filtered both after the log is
    // taken, and before summing into the denominator. We can do this with two multiplications
    //
    // filtered_log_softmax = for_each_i filter[i] * log(e^xs[i] / sum_j filter[j] * e^xs[j]))
    //
    // This is mathematically correct, but it turns out there's a numeric stability trick we need to do,
    // without it we're seeing NaNs. Sourcing the trick from: https://stackoverflow.com/a/52132033
    //
    // We can do the same transformation here, and come out with the following expression:
    //
    // let xs_max = max_i xs[i]
    // for_each_i filter[i] * (xs[i] - xs_max - log(sum_j filter[j] * e^(xs[j] - xs_max))
    //
    // Keep in mind that the actual implementation below is further vectorized over an initial batch dimension.

    let (xs_max, _) = xs.max_dim(1, true);
    let xs_offset = xs - xs_max;
    // TODO: Replace with Tensor::linalg_vecdot(&filter, &xs_offset.exp(), 1).log();
    // when we update tch-rs (linalg_vecdot is new in pytorch 1.13)
    let constant_sub = (filter * &xs_offset.exp()).sum_to_size(&[xs.size()[0], 1]).log();
    filter * (&xs_offset - constant_sub)
}
gmorenz
  • 1
  • 1