Keras: How to shape inputs for CNN and LSTM layers?

Question

I am building a model to predict geospatial-temporal datasets.

My data has original dimensions (features, lat, lon, time), i.e. for each feature and at each lat/lon point there is a time series.

I have created a CNN-LSTM model using Keras like so (I assume the below needs to be modified, this is just a first attempt):

def define_model_cnn_lstm(features, lats, lons, times):
    """
    Create and return a model with CN and LSTM layers. Input and output data is 
    expected to have shape (lats, lons, times).

    :param lats: latitude dimension of input 3-D array 
    :param lons: longitude dimension of input 3-D array
    :param times: time dimension of input 3-D array
    :return: CNN-LSTM model appropriate to the expected input array
    """
    # define the CNN model layers, wrapping each CNN layer in a TimeDistributed layer
    model = Sequential()
    model.add(TimeDistributed(Conv2D(features, (3, 3), 
                                     activation='relu', 
                                     padding='same', 
                                     input_shape=(lats, lons, times))))
    model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
    model.add(TimeDistributed(Flatten()))

    # add the LSTM layer, and a final Dense layer
    model.add(LSTM(units=times, activation='relu', stateful=True))
    model.add(Dense(1))

    model.compile(optimizer='adam', loss='mse')

    return model

My assumption is that this model will take data with shape (features, lats, lons, times), so for example if my geospatial grid is 180 x 360 and there are 100 time steps at each point, and I have 4 features per observation/sample, then the shape will be (4, 180, 360, 100).

I assume that I will want the model to take arrays with shape (features, lats, lons, times) as input and be able to predict labels arrays with shape (labels, lats, lons, times) as output. I am first using a single variable as my label, but it might be interesting later to be able to have multivariate output as well (i.e. labels > 1).

How should I best shape my data for input, and/or how to structure the model layers in a way that's most appropriate for this application?

You have mulitple geospatial grid, right? i.e. the whole training data looks like `(num_grids, features, lats, lons, time)`? — today, Sep 28 '18 at 20:29
No, there is a single geospatial grid (lats x lons) of values. Each lat/lon point has multiple features (4 in the example described above). — James Adams, Sep 28 '18 at 20:32
Then I am confused a bit: don't you have a timeseries of multi-channel (i.e. features) spatial maps? And what do you want to predict? The next steps of timeseries along time dimension? — today, Sep 28 '18 at 20:35
Trust me, I'm the one who's confused here. If I understand correctly I should instead look at this as a "multi-channel" dataset, i.e. each feature is a channel. What I am trying to predict is a corresponding dataset where y == f(X), and the model is being used as f(). For example at each lat/lon we have a timeseries with temperature and humidity values (the features), and the model should be able to predict corresponding a precipitation timeseries (the label). — James Adams, Sep 28 '18 at 20:44
And one more question: you mentioned you don't have multiple grids, so you mean for example you have only **a single** training data of shape `(4, 180, 360, 100)`? That would be too little data. How many timesteps are there then? Maybe the length of timeseries is too long?! — today, Sep 28 '18 at 22:08
Yes, the timeseries dimension is lengthy, it can be in the thousands. Also the lats and lons are more numerous, but the above numbers were used for simplicity. We have the ability to modify the grid size as well as the number of timesteps. — James Adams, Sep 28 '18 at 22:22

today · Accepted Answer · 2018-09-28T23:29:33.563

Well, I think it is better to reshape your data to (time, lats, lons, features), i.e. it is a timeseries of mutli-channel (i.e. features) spatial maps:

data = np.transpose(data, [3, 1, 2, 0])

Then you can easily wrap Conv2D and MaxPooling2D layers inside a TimeDistributed layer to process the (multi-channel) maps at each timestep:

num_steps = 50
lats = 128
lons = 128
features = 4
out_feats = 3

model = Sequential()
model.add(TimeDistributed(Conv2D(16, (3, 3), activation='relu', padding='same'), 
                          input_shape=(num_steps, lats, lons, features)))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed(Conv2D(32, (3, 3), activation='relu', padding='same')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(TimeDistributed(Conv2D(32, (3, 3), activation='relu', padding='same')))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))

So far we would have a tensor of shape of (50, 16, 16, 32). Then we can use Flatten layer (of course, wrapped in a TimeDistributed layer to not lose time axis) and feed the result to one or multiple LSTM layers (with return_sequence=True to get the output at each timestep):

model.add(TimeDistributed(Flatten()))

# you may stack multiple LSTM layers on top of each other here
model.add(LSTM(units=64, return_sequences=True))

Then we need to go back up. So we need to first reshape the result of LSTM layers to make it 2D and then use the combination of UpSampling2D and Conv2D layers to get the original map's shape back:

model.add(TimeDistributed(Reshape((8, 8, 1))))
model.add(TimeDistributed(UpSampling2D((2,2))))
model.add(TimeDistributed(Conv2D(32, (3,3), activation='relu', padding='same')))
model.add(TimeDistributed(UpSampling2D((2,2))))
model.add(TimeDistributed(Conv2D(32, (3,3), activation='relu', padding='same')))
model.add(TimeDistributed(UpSampling2D((2,2))))
model.add(TimeDistributed(Conv2D(16, (3,3), activation='relu', padding='same')))
model.add(TimeDistributed(UpSampling2D((2,2))))
model.add(TimeDistributed(Conv2D(out_feats, (3,3), padding='same')))

Here is the model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_132 (TimeDi (None, 50, 128, 128, 16)  592       
_________________________________________________________________
time_distributed_133 (TimeDi (None, 50, 64, 64, 16)    0         
_________________________________________________________________
time_distributed_134 (TimeDi (None, 50, 64, 64, 32)    4640      
_________________________________________________________________
time_distributed_135 (TimeDi (None, 50, 32, 32, 32)    0         
_________________________________________________________________
time_distributed_136 (TimeDi (None, 50, 32, 32, 32)    9248      
_________________________________________________________________
time_distributed_137 (TimeDi (None, 50, 16, 16, 32)    0         
_________________________________________________________________
time_distributed_138 (TimeDi (None, 50, 8192)          0         
_________________________________________________________________
lstm_13 (LSTM)               (None, 50, 64)            2113792   
_________________________________________________________________
time_distributed_139 (TimeDi (None, 50, 8, 8, 1)       0         
_________________________________________________________________
time_distributed_140 (TimeDi (None, 50, 16, 16, 1)     0         
_________________________________________________________________
time_distributed_141 (TimeDi (None, 50, 16, 16, 32)    320       
_________________________________________________________________
time_distributed_142 (TimeDi (None, 50, 32, 32, 32)    0         
_________________________________________________________________
time_distributed_143 (TimeDi (None, 50, 32, 32, 32)    9248      
_________________________________________________________________
time_distributed_144 (TimeDi (None, 50, 64, 64, 32)    0         
_________________________________________________________________
time_distributed_145 (TimeDi (None, 50, 64, 64, 16)    4624      
_________________________________________________________________
time_distributed_146 (TimeDi (None, 50, 128, 128, 16)  0         
_________________________________________________________________
time_distributed_147 (TimeDi (None, 50, 128, 128, 3)   435       
=================================================================
Total params: 2,142,899
Trainable params: 2,142,899
Non-trainable params: 0
_________________________________________________________________

As you can see we have a output tensor of shape (50, 128, 128, 3) where 3 refers to number of desired labels we want to predict for location at each timestep.

Further notes:

As the number of layers and parameters increases (i.e. the model becomes deeper), you may need to deal with problems such as vanishing gradient (1, 2) and overfitting (1, 2, 3). One solution for the former is to use BatchNormalization layer right after each (trainable) layer to ensure that the data being fed to next layer is normalized. To prevent overfitting you could use Dropout layers (and/or set dropout and recurrent_dropout arguments in LSTM layer).
As you can see above, I have assumed that we are feeding the model a timeseries of length 50. This is concerned with data preprocessing step where you need to create windowed training (and test) samples from your whole (long) timeseries and feed them in batches to your model for training.
As I have commented in the code, you can add multiple LSTM layers on top of each other to increase the representational capacity of the network. But be aware it may increase the training time and it make your model (much more) prone to overfitting. So do it if you have justified reasons for it (i.e. you have experimented with one LSTM layer and have not gotten good results). Alternatively, you can use GRU layers instead, but there might be a tradeoff between representation capacity and computational cost (i.e. training time) compared to LSTM layer.
To make the output shape of the network compatible with the shape of your data, you could use a Dense layer after the LSTM layer(s) or adjust the number of units of last LSTM layer.
Obviously, the above code is just for demonstration and you may need to tune its hyperparamters (e.g. number of layers, number of filters, kernel size, optimizer used, activation functions, etc.) and experiment (a lot!) to achieve a final working model with great accuracy.
If you are training on a GPU, you can use CuDNNLSTM (CuDNNGRU) layer instead of LSTM (GRU) to increase training speed as it is has been optimized for GPUs.
And don't forget to normalize the training data (it's very important and helps training process a lot).

As far as the various numeric arguments provided for the Conv2D, MaxPooling2D, LSTM, Reshape, and UpSampling2D layers: is it possible for me to use various input dimension sizes and ratios thereof for these arguments (for example the LSTM's units argument or the Conv2D's filters and kernel_size arguments) within a general purpose model definition function, or is it more usual to have these values hard-coded corresponding to known input data dimensions? If so then I wouldn't need to know the number of lats/lons/steps beforehand, allowing for a more general purpose model. Maybe too ambitious... — James Adams, Sep 29 '18 at 01:27
@JamesAdams Well, of course you can write a function that takes these parameters as its input and create a model based on them. It might even speed up experimenting process a bit (don't forget that you need to experiment a lot to find the final model as I said). However, there are some common values used. For example kernel size of conv layers are usually 3 or 5, or the number filters is a power of two (16, 32, 64, ...) and as we go deeper the number of filters increases (and the spatial dimension decreased because of applying MaxPooling2D). >>> — today, Sep 29 '18 at 08:12
@JamesAdams >>> At the end there is no definite answer to give about hyperparameters and it is an active area of research. It really depends on the data you have and the problem you are trying to solve. There are some packages like [hyperas](https://github.com/maxpumperla/hyperas) that do this kind of hyperparameter tuning. — today, Sep 29 '18 at 08:14
Wouldn't it be better to use `Conv2DTranspose` instead of `UpSampling2D`? — 0x90, Oct 05 '18 at 15:26
@0x90 Well, of course that's another upsampling layer. But whether to use it or not depends on the difficulty of the problem you are solving. Sometimes even the combination of `UpSampling2D` and `Conv2D` would work for a problem. You must experiment and also keep in mind the overhead (in terms of parameters and training time) of adding different layers. — today, Oct 05 '18 at 15:34

Keras: How to shape inputs for CNN and LSTM layers?

1 Answers1

Linked