I am doing an image captioning task using vectors for representing both images and captions.
The caption vectors have a legth/dimension of size 128. The image vectors have a length/dimension of size 2048.
What I want to do is to train an autoencoder, to get an encoder which is able to convert text vector into a image vector. And a decoder which is able to convert an image vector into a text vector.
Encoder: 128 -> 2048.
Decoder: 2048 -> 128.
I followed this tutorial to implement a shallow network doing what I wanted.
But I cant figure out how to create a deep network, following the same tutorial.
x_dim = 128
y_dim = 2048
x_dim_shape = Input(shape=(x_dim,))
encoded = Dense(512, activation='relu')(x_dim_shape)
encoded = Dense(1024, activation='relu')(encoded)
encoded = Dense(y_dim, activation='relu')(encoded)
decoded = Dense(1024, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(decoded)
decoded = Dense(x_dim, activation='sigmoid')(decoded)
# this model maps an input to its reconstruction
autoencoder = Model(input=x_dim_shape, output=decoded)
# this model maps an input to its encoded representation
encoder = Model(input=x_dim_shape, output=encoded)
encoded_input = Input(shape=(y_dim,))
decoder_layer1 = autoencoder.layers[-3]
decoder_layer2 = autoencoder.layers[-2]
decoder_layer3 = autoencoder.layers[-1]
# create the decoder model
decoder = Model(input=encoded_input, output=decoder_layer3(decoder_layer2(decoder_layer1(encoded_input))))
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
autoencoder.fit(training_data_x, training_data_y,
nb_epoch=50,
batch_size=256,
shuffle=True,
validation_data=(test_data_x, test_data_y))
The training_data_x and test_data_x have 128 dimensions. The training_data_y and test_data_y have 2048 dimensions.
The error I receive while trying to run this is the following:
Exception: Error when checking model target: expected dense_6 to have shape (None, 128) but got array with shape (32360, 2048)
dense_6 is the last decoded variable.