1

I have a data frame like this, of DNA sequences:

Feature         Label
GCTAGATGACAGT   0
TTTTAAAACAG     1
TAGCTATACT      2    
TGGGGCAAAAAAAA  0
AATGTCG         3
AATGTCG         0
AATGTCG         1

Where there is one column with a DNA sequence, and a label that can either be 0,1,2,3 (i.e. a category of that DNA sequence). I want to develop a NN that predicts probability of classification of each sequence into the 1,2 or 3 category (not 0, i don't care about 0). Each sequence can appear multiple times in the data frame, and it is possible that each sequence appears in multiple (or all) categories. So the output should look like this:

GCTAGATGACAGT   (0.9,0.1,0.2)
TTTTAAAACAG     (0.7,0.6,0.3)
TAGCTATACT      (0.3,0.3,0.2)    
TGGGGCAAAAAAAA  (0.1,0.5,0.6)

Where the numbers in the tuple are the probability that the sequence is found in category 1,2 and 3.

This is my code:

import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn.model_selection import StratifiedKFold
from keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib
from matplotlib import pyplot
import os
from random import random
from numpy import array
from numpy import cumsum
import pandas as pd
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
os.environ['KMP_DUPLICATE_LIB_OK']='True'
%matplotlib
from sklearn.feature_extraction.text import CountVectorizer


# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

#read in the file
df = pd.read_csv('dna_sequences.txt')
X = list(df['dna_sequence'])
y = list(df['class'])


#convert the sequences to integers for learning
tokenizer = Tokenizer(num_words=5,char_level=True)
tokenizer.fit_on_texts(X)
data_encoded = tokenizer.texts_to_matrix(X,mode='count')

kf = kfold.get_n_splits(data_encoded)
cvscores = []

#for each train, test in cross validation sub-set
for train, test in kfold.split(data_encoded, y):

    X_train, X_test = data_encoded[train], data_encoded[test]
    y_train, y_test = data_encoded[train], data_encoded[test]

    #add layers to model
    model = Sequential()
    model.add(Embedding(3000, 32, input_length=5))
    model.add(Dropout(0.2))
    model.add(Bidirectional(LSTM(20, return_sequences=True), input_shape=(5, 1)))
    model.add(LSTM(100))
    model.add(Dropout(0.2))
    model.add(Dense(5, activation='sigmoid'))

    #compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    #check the model
    print(model.summary())

    #monitor val accuracy and perform early stopping
    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
    mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)

    #fit the model
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64) #change values

    #evaluate the model
    scores = model.evaluate(X_test, y_test, verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)

#check the accuracy
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))



#predict for new set of seqs
pred_list = ['GTGTGCGCT','GGGGGTCGCTCCCCCC','AAATGTTGT','GTGTGTGGG','CCCCTATATA']

#output a probability of sequence being found in each class as described above, and plot accuracy and loss

It runs, and prints the accuracy as expected (the accuracy isn't great, 62%, but I can work on that, this is my first NN, just want to get an example running).

My question is specifically regarding predicting. Could someone show me an example of the jump from fitting the model (which I have above), to actually predicting. I think the algorithm involves:

  1. finding the best model from cross validation (i tried to incorporate this with the monitor val accuracy section)
  2. the list of sequences to predict the class for is in pred_list
  3. fit the best model from training to pred_list
  4. return probabilities as described at top of question.

I know this exist from other questions (e.g. here):

prediction = model.predict(np.array(tk.texts_to_sequences(text)))
print(prediction)

....but I don't know how to incorporate this with cross validation, and also in a way that I get my desired output (i.e. a three class probability per sequence of being assigned to class 1,2 or 3 in the training data set, where each sequence can appear in multiple classes).

Edit 1: based on the below comment, I changed the end of the code to:

(in cross validation loop, so should be indented)

#evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)

#predict for new set of seqs
pred_list = ['GTGTGCGCT','GGGGGTCGCTCCCCCC','AAATGTTGT','GTGTGTGGG','CCCCTATATA', 'GGGGGGGGGTTTTTTTT']
prediction = model.predict(np.array(tokenizer.texts_to_sequences(pred_list)))
predcvscores.append(prediction)

(out of cross validation loop)

print(predcvscores)

#check the accuracy
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))

I get the error:

Error when checking input: expected embedding_3_input to have shape (5,) but got array with shape (1,)

I guess what this is saying is that I can't just read in a set of sequences like pred_list? Is it not possible to do this, or am I not taking the right approach? Also, I'm not sure that this method will give me, for each item in pred_list, an output that is a probability of appearing in category 1,2 or 3, but maybe I am wrong and it will.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60
  • Side note: this is my first ever NN so if anyone has any other code suggestion changes to the above code I appreciate it, right now I'm trying to improve my code one step at a time with specific questions in places that I know need improvement. – Slowat_Kela May 13 '20 at 21:13
  • one thing i would like to add is, it is not necessary, that best model in your 5 fold cross val should give the best result at your test data. – Talha Anwar May 13 '20 at 21:19
  • simply you have to put `prediction = model.predict(np.array(tk.texts_to_sequences(text))) ` above `cvscores.append(scores[1] * 100)` and append predictions in a new list – Talha Anwar May 13 '20 at 21:22
  • the new list has 5 arrays of predictions, and your cv score has 5 accuracies, you can then figure out at which index you got the maximum accuracy and then get that index from prediction list. – Talha Anwar May 13 '20 at 21:23

1 Answers1

2

You ask too many and quite irrelevant things in a single question, and there are several issues in it. I'll try to address what I see as the most serious ones.

To start with, if you have cases of the form

Feature         Label
AATGTCG         3
AATGTCG         0
AATGTCG         1

i.e. the very same single feature can belong to either class 0, 1, or 3, without any other features present, then the message here is that supervised classification is probably not a good fit for your problem at hand; to be so, you should use additional features.

If, as you say, you are interested only in classes 1, 2, and 3, and

not 0, i don't care about 0

then the very first thing you should do already in your data preparation stage is to remove all instances of class 0 from your dataset; it is not clear if you do so here, and even if you do, it is again not clear why you leave class 0 still into the discussion.

Second (and assuming that you have indeed only 3 classes left in your classification problem), what you show as expected output of the model:

GCTAGATGACAGT   (0.9,0.1,0.2)
TTTTAAAACAG     (0.7,0.6,0.3)
TAGCTATACT      (0.3,0.3,0.2)    
TGGGGCAAAAAAAA  (0.1,0.5,0.6)

is not correct; in multi-class classification, the returned probabilities (i.e. the numbers in the parentheses here) must add up exactly to 1, which is not the case here.

Third, since you have a multi-class classification problem, your loss should be categorical_crossentropy, and not binary_crossentropy, which is used for binary classification problems only.

Fourth, and again assuming that you are left with 3 classes only, the last layer of your model should be

model.add(Dense(3, activation='softmax') # no. of units here should be equal to the no. of classes)

while your labels y should be one-hot encoded (you can easily do that using the Keras function to_categorical).

Fifth, looking closely at your data definition in the beginning of your loop:

X_train, X_test = data_encoded[train], data_encoded[test]
y_train, y_test = data_encoded[train], data_encoded[test]

you can easily see that you pass the features both as features and as labels. I can only guess that this must be a typo from your side; the labels should be:

y_train, y_test = y[train], y[test]

Regarding your prediction-time error

Error when checking input: expected embedding_3_input to have shape (5,) but got array with shape (1,)

this is due to the argument input_length=5 in your embedding layer. I'll confess here that I am not at all familiar with Keras embedding layers; you might want to check the documentation to ensure that this argument and the assigned value do indeed what you think/intend to do.

Other than that, regarding your specific question:

My question is specifically regarding predicting. Could someone show me an example of the jump from fitting the model (which I have above), to actually predicting.

you should just re-compile and re-fit the model again outside the CV loop (possibly using the "best" number of epochs found during CV) with the whole of your data, and then use it for predictions.


I guess it should be clear by now that, given the above issues, your reported accuracy of 62% does not actually mean anything; for good or bad, Keras will not "protect" you if you attempt to do things that are not meaningful from a modeling perspective (like most of the things I have mentioned above), like using binary cross entropy in a multi-class problem, or using accuracy in a regression setting...

desertnaut
  • 57,590
  • 26
  • 140
  • 166