2

A standard RNN computational graph looks like follows (In my case, for regression to a single scalar value y)

enter image description here

I want to construct a network which accepts as input m sequences X_1...X_m (where both m and sequence lengths vary), runs the RNN on each sequence X_i to obtain a representation vector R_i, averages the representations and then runs a fully connected net to compute the output y_hat. Computational graph should look something like this:

enter image description here

Question

Can this be implemented (preferably) in Keras? Otherwise in TensorFlow? I'd very much appreciate if someone can point me to a working implementation of this or something similar.

H.Rappeport
  • 517
  • 6
  • 17
  • Can certainly be done, but need some clarification: what is being averaged, exactly? The last timestep's hidden states across the `m` samples? Note that unless the samples are somehow meaningfully related and aren't independent, you can get very poor results, as the resultant tensor will effectively be noise. Also, what's the expected dimensionality of the averaged quantity (i.e. along which axis are you averaging)? – OverLordGoldDragon Nov 27 '19 at 16:21
  • @OverLordGoldDragon There is a (learned) function (W_hr) from the last hidden state to a representation vector of some fixed size, and the m representation vectors are averaged to obtain the final representation vector R. The m samples for each given target are indeed expected to be related – H.Rappeport Nov 27 '19 at 16:53
  • Are you working with LSTMs? There will be extra steps compared w/ SimpleRNN or GRU – OverLordGoldDragon Nov 27 '19 at 19:10
  • Alright so, the main obstacle in using `keras` layers / models will be in circumventing the `batch_size` enforcement - that is, it expects `num_samples_out == num_samples_in`, which breaks down at the averaging step. Numerous approaches exist, many are hackish, one is valid but involves defining custom `tf` functionality (which I'm not too familiar with), and another depends on your problem definition: – OverLordGoldDragon Nov 27 '19 at 19:20
  • You could cleverly feed `m` samples along a non-sample axis, and expand dimension along dim 0 so to make Keras think it's only one sample. Then, gradient will be computed w.r.t. this one "sample" (i.e. no gradient averaging, and batch normalization layers will work very differently), but you still get to feed to RNNs the original sequence by squashing/squeezing the expanded input, then to output by expanding again -- I'm not entirely sure if all the gradients would work as intended, but that's up to you to determine – OverLordGoldDragon Nov 27 '19 at 19:23
  • @OverLordGoldDragon First of all thank you. GRUs will suffice if there really is a dramatic difference. I'm not sure I follow your proposal, what does "non sample axis" mean? I would like the gradients to propagate corresponding to the computational graph I attached. I don't mind "hackish" as long as it works , but on the other hand writing custom Tensorflow functionality is also fine as long as I have a clear objective of what exactly is required – H.Rappeport Nov 27 '19 at 20:18
  • Waait a minute, silly me - is each sequence/sample _univariate_ (1D vector)? If so, there's a really nice workaround – OverLordGoldDragon Nov 28 '19 at 05:35
  • Yes, they are univariate – H.Rappeport Nov 28 '19 at 06:53
  • Nevermind, no nice workarounds, but came up with a _mildly_ hackish implementation (whose gradient implications may be less mild) – OverLordGoldDragon Nov 28 '19 at 20:06

1 Answers1

1

There isn't a straightforward Keras implementation, as Keras enforces the batch axis (sampels dimension, dimension 0) as fixed for the input & output layers (but not all layers in-between) - whereas you seek to collapse it by averaging. There is, however, a workaround - see below:

import tensorflow.keras.backend as K
from tensorflow.keras.layers import Input, Dense, GRU, Lambda
from tensorflow.keras.layers import Reshape, GlobalAveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.utils  import plot_model
import numpy as np

def make_model(batch_shape):
    ipt  = Input(batch_shape=batch_shape)
    x    = Lambda(lambda x: K.squeeze(x, 0))(ipt)
    x, s = GRU(4, return_state=True)(x) # s == last returned state 
    x    = Lambda(lambda x: K.expand_dims(x, 0))(s)
    x    = GlobalAveragePooling1D()(x)  # averages along axis1 (original axis2)
    x    = Dense(32, activation='relu')(x)
    out  = Dense(1,  activation='sigmoid')(x)

    model = Model(ipt, out)
    model.compile('adam', 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return (np.random.randn(*batch_shape),
            np.random.randint(0, 2, (batch_shape[0], 1)))

m, timesteps = 16, 100
batch_shape = (1, m, timesteps, 1)

model = make_model(batch_shape)
model.summary()  # see model structure
plot_model(model, show_shapes=True)

x, y = make_data(batch_shape)
model.train_on_batch(x, y)

Above assumes the task is binary classification, but you can easily adapt it to anything else - the main task's tricking Keras by feeding m samples as 1, and the rest of layers can freely take m instead as Keras doesn't enforce the 1 there.

Note, however, that I cannot guarantee this'll work as intended per the following:

  1. Keras treats all entries along the batch axis as independent, whereas your samples are claimed as dependent
  2. Per (1), the main concern is backpropagation: I'm not really sure how gradient will flow with all the dimensionality shuffling going on.
  3. (1) is also consequential for stateful RNNs, as Keras constructs batch_size number of independent states, which'll still likely behave as intended as all they do is keep memory, but still worth understanding fully - see here

(2) is the "elephant in the room", but aside that, the model fits your exact description. Chances are, if you've planned out forward-prop and all dims agree w/ code's, it'll work as intended - else, and also for sanity-check, I'd suggest opening another question to verify gradients flow as you intend them to per above code.


model.summary():

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(1, 32, 100, 1)]         0         
_________________________________________________________________
lambda (Lambda)              (32, 100, 1)              0         
_________________________________________________________________
gru (GRU)                    [(32, 16), (32, 16)]      864       
_________________________________________________________________
lambda_1 (Lambda)            (1, 32, 16)               0         
_________________________________________________________________
global_average_pooling1d (Gl (1, 16)                   0         
_________________________________________________________________
dense (Dense)                (1, 8)                    136       
_________________________________________________________________
dense_1 (Dense)              (1, 1)                    9     

On LSTMs: will return two last states, one for cell state, one for hidden state - see source code; you should understand what this exactly means if you are to use it. If you do, you'll need concatenate:

from tensorflow.keras.layers import concatenate
# ...
x, s1, s2 = LSTM(return_state=True)(x)
x = concatenate([s1, s2], axis=-1)
# ...
OverLordGoldDragon
  • 1
  • 9
  • 53
  • 101