I have a data frame like this, of DNA sequences:
Feature Label
GCTAGATGACAGT 0
TTTTAAAACAG 1
TAGCTATACT 2
TGGGGCAAAAAAAA 0
AATGTCG 3
AATGTCG 0
AATGTCG 1
Where there is one column with a DNA sequence, and a label that can either be 0,1,2,3 (i.e. a category of that DNA sequence). I want to develop a NN that predicts probability of classification of each sequence into the 1,2 or 3 category (not 0, i don't care about 0). Each sequence can appear multiple times in the data frame, and it is possible that each sequence appears in multiple (or all) categories. So the output should look like this:
GCTAGATGACAGT (0.9,0.1,0.2)
TTTTAAAACAG (0.7,0.6,0.3)
TAGCTATACT (0.3,0.3,0.2)
TGGGGCAAAAAAAA (0.1,0.5,0.6)
Where the numbers in the tuple are the probability that the sequence is found in category 1,2 and 3.
This is my code:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from sklearn.model_selection import StratifiedKFold
from keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib
from matplotlib import pyplot
import os
from random import random
from numpy import array
from numpy import cumsum
import pandas as pd
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
os.environ['KMP_DUPLICATE_LIB_OK']='True'
%matplotlib
from sklearn.feature_extraction.text import CountVectorizer
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
#read in the file
df = pd.read_csv('dna_sequences.txt')
X = list(df['dna_sequence'])
y = list(df['class'])
#convert the sequences to integers for learning
tokenizer = Tokenizer(num_words=5,char_level=True)
tokenizer.fit_on_texts(X)
data_encoded = tokenizer.texts_to_matrix(X,mode='count')
kf = kfold.get_n_splits(data_encoded)
cvscores = []
#for each train, test in cross validation sub-set
for train, test in kfold.split(data_encoded, y):
X_train, X_test = data_encoded[train], data_encoded[test]
y_train, y_test = data_encoded[train], data_encoded[test]
#add layers to model
model = Sequential()
model.add(Embedding(3000, 32, input_length=5))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(20, return_sequences=True), input_shape=(5, 1)))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(5, activation='sigmoid'))
#compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#check the model
print(model.summary())
#monitor val accuracy and perform early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
#fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64) #change values
#evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
#check the accuracy
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
#predict for new set of seqs
pred_list = ['GTGTGCGCT','GGGGGTCGCTCCCCCC','AAATGTTGT','GTGTGTGGG','CCCCTATATA']
#output a probability of sequence being found in each class as described above, and plot accuracy and loss
It runs, and prints the accuracy as expected (the accuracy isn't great, 62%, but I can work on that, this is my first NN, just want to get an example running).
My question is specifically regarding predicting. Could someone show me an example of the jump from fitting the model (which I have above), to actually predicting. I think the algorithm involves:
- finding the best model from cross validation (i tried to incorporate this with the monitor val accuracy section)
- the list of sequences to predict the class for is in pred_list
- fit the best model from training to pred_list
- return probabilities as described at top of question.
I know this exist from other questions (e.g. here):
prediction = model.predict(np.array(tk.texts_to_sequences(text)))
print(prediction)
....but I don't know how to incorporate this with cross validation, and also in a way that I get my desired output (i.e. a three class probability per sequence of being assigned to class 1,2 or 3 in the training data set, where each sequence can appear in multiple classes).
Edit 1: based on the below comment, I changed the end of the code to:
(in cross validation loop, so should be indented)
#evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
#predict for new set of seqs
pred_list = ['GTGTGCGCT','GGGGGTCGCTCCCCCC','AAATGTTGT','GTGTGTGGG','CCCCTATATA', 'GGGGGGGGGTTTTTTTT']
prediction = model.predict(np.array(tokenizer.texts_to_sequences(pred_list)))
predcvscores.append(prediction)
(out of cross validation loop)
print(predcvscores)
#check the accuracy
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
I get the error:
Error when checking input: expected embedding_3_input to have shape (5,) but got array with shape (1,)
I guess what this is saying is that I can't just read in a set of sequences like pred_list? Is it not possible to do this, or am I not taking the right approach? Also, I'm not sure that this method will give me, for each item in pred_list, an output that is a probability of appearing in category 1,2 or 3, but maybe I am wrong and it will.