Understanding 1D convolution of DNA sequences encoded as a one hot vector

Question

I am trying to run a classification task with convolution neural network for DNA sequences. The DNA sequences are being converted to a input array encoded as a one hot vector. For example - "ACTG" is encoded as [[1,0,0,0], [0,1,0,0], [0,0,0,1], [0,0,1,0]]. I have encoded each of the samples like that. The Dimension of the input would be number_of_samples * length_of_samples*4. I am trying to now understand how a 1D Convolution would work on an input array like this one, but I cannot workout how the output of the 1d convolution would look like. Would really appreciate some help. For reference I am using this code by Kundaje Lab, Standord University. I cannot understand how a 1D convolution would work for an input of 3 Dimensions.

score 1 · Answer 1 · answered Oct 30 '18 at 20:27

Here is the documentation to the Keras Conv1D module, where they describe the input to the model as fixed or variable number of sequences of a fixed length (as given in the example, (10,128) : 10 sequences, each of length 128).

1D convolution can be thought of as running through a single spatial or temporal dimension of a 2D data. This stack overflow answer gives a pretty clear explanation about the various types of Conv Layers.

Coming to your problem, I have made a toy program with 2 conv layers and random data, which I think you might find useful.

data = np.random.random((64,4,4))
labels = np.random.random((64,2))
dataset = tf.data.Dataset.from_tensor_slices((data,labels))
dataset = dataset.batch(2).repeat()
inputs = Input(shape=(4,4))

x = Conv1D(32, 3, activation='relu')(inputs)
x = Flatten()(x)
x = Dense(32, activation='relu')(x)
predictions = Dense(2, activation='softmax')(x)

model = keras.Model(inputs=inputs, outputs=predictions)

model.compile(optimizer='adam',
          loss='categorical_crossentropy',
          metrics=['accuracy'])
model.summary()
model.fit(dataset.make_one_shot_iterator(), epochs=5, steps_per_epoch=100)

Result:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 4, 4)              0         
_________________________________________________________________
conv1d (Conv1D)              (None, 2, 32)             416       
_________________________________________________________________
flatten (Flatten)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 66        
=================================================================
Total params: 2,562
Trainable params: 2,562
Non-trainable params: 0

Epoch 1/5
100/100 [==============================] - 1s 11ms/step - loss: 0.7027 - acc: 0.5450
Epoch 2/5
100/100 [==============================] - 1s 7ms/step - loss: 0.6874 - acc: 0.6000
Epoch 3/5
100/100 [==============================] - 1s 7ms/step - loss: 0.6838 - acc: 0.6200
Epoch 4/5
100/100 [==============================] - 1s 7ms/step - loss: 0.6753 - acc: 0.6100
Epoch 5/5
100/100 [==============================] - 1s 7ms/step - loss: 0.6656 - acc: 0.6300

now you could replace the 4 there with a shape (no_of_sequences,4) and define your own model this way. However, if you want to use something like (None,4), in the case where your sequences are variable and there is no fixed length you can choose, you would run in to trouble with the Dense layer when using Tensorflow backend, which requires the last dimension of the input. So you could probably decide on the best shape that matches this requirement.

score 0 · Answer 2 · answered Jul 16 '20 at 11:24

Recently I stumbled on the same problem, in particular, I wondered, how 1DConvolution treats one-hot encoded vector with n*4 dimensions. What happens with these 4 dimensions corresponding to nucleotide type? In my model I worked on sequences of 125 bp, and in the first Conv1D layer I had 1500 filters with kernel_size=10 and stride=1. Here is the model summary, which helps to understand, what happens:

Layer (type)                 Output Shape    Param #   
=====================================================
input_1 (InputLayer)         [(None, 125,4)]      0         
________________________________________________________
conv1d (Conv1D)              (None, 116, 1500)   61500

In the input layer we see correctly batch size dimension followed by sequence and one-hot dimensions. Then in the output of Conv1D sequence dimensions shortens (125-10+1)=116, 1500 channel dimension is added, while one-hot dimension disappears! How is the information about the nucleotide type stored in the convolution? We can get this from number of parameters. 61500=1500*(1+4*10) from this equality we see that weight matrix stores individual weights for each nucleotide at each site of the convolutional filter, so we do not loose any information, when we apply it to the sequence. And that is why it is important not to mess the order: sequence first then one-hot encoding afterwards.

Understanding 1D convolution of DNA sequences encoded as a one hot vector

2 Answers2