10

My neural network has the following architecture:

input -> 128x (separate fully connected layers) -> output averaging

I am using a ModuleList to hold the list of fully connected layers. Here's how it looks at this point:

class MultiHead(nn.Module):
    def __init__(self, dim_state, dim_action, hidden_size=32, nb_heads=1):
        super(MultiHead, self).__init__()

        self.networks = nn.ModuleList()
        for _ in range(nb_heads):
            network = nn.Sequential(
                nn.Linear(dim_state, hidden_size),
                nn.Tanh(),
                nn.Linear(hidden_size, dim_action)
            )
            self.networks.append(network)

        self.cuda()
        self.optimizer = optim.Adam(self.parameters())

Then, when I need to calculate the output, I use a for ... in construct to perform the forward and backward pass through all the layers:

q_values = torch.cat([net(observations) for net in self.networks])

# skipped code which ultimately computes the loss I need

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

This works! But I am wondering if I couldn't do this more efficiently. I feel like by doing a for...in, I am actually going through each separate FC layer one by one, while I'd expect this operation could be done in parallel.

MasterScrat
  • 7,090
  • 14
  • 48
  • 80

1 Answers1

10

In the case of Convnd in place of Linear you could use the groups argument for "grouped convolutions" (a.k.a. "depthwise convolutions"). This let's you handle all parallel networks simultaneously.

If you use a convolution kernel of size 1, then the convolution does nothing else than applying a Linear layer, where each channel is considered an input dimension. So the rough structure of your network would look like this:

  1. Modify the input tensor of shape B x dim_state as follows: add an additional dimension and replicate by nb_state-times B x dim_state to B x (dim_state * nb_heads) x 1
  2. replace the two Linear with
nn.Conv1d(in_channels=dim_state * nb_heads, out_channels=hidden_size * nb_heads, kernel_size=1, groups=nb_heads)

and

nn.Conv1d(in_channels=hidden_size * nb_heads, out_channels=dim_action * nb_heads, kernel_size=1, groups=nb_heads)
  1. we now have a tensor of size B x (dim_action x nb_heads) x 1 you can now modify it to whatever shape you want (e.g. B x nb_heads x dim_action)

While CUDA natively supports grouped convolutions, there were some issues in pytorch with the speed of grouped convolutions (see e.g. here) but I think that was solved now.

flawr
  • 10,814
  • 3
  • 41
  • 71