I'm reading about NN, and would also like to generate my first NN at the same time (to complement my reading).
I have a data set like this:
DNA_seq Sample1Name Sample1Name ConcOfDNAInSample DNASeqFoundInProcessCat
AGGAG cat_0 cat_1 0.1 found_in_0
AGGAG cat_1 cat_2 0.4 found_in_3
ACCCC cat_1 cat_7 0.1 found_in_2
AGAGAGA cat_2 cat_10 1.9 found_in_1
ADAS cat_332 cat_103 8.9 found_in_1
Columns:
- DNASeq -> a string of a DNA sequence (i.e. 'the sequences')
- Sample1Name -> categorical value explaining a chemical property of the solution that DNASeq is in.
- Sample2Name -> categorical value explaining a chemical property of the solution that DNASeq is in.
- ConcOfDNAInSample -> a quantitative value of DNA concentration in Sample2SName.
- DNASeqFoundInProcessCat -> This is the label that I want to predict. It is a categorical value with four categories (found_in_0 -> found_in_3). This is the output from where I did three tests on each DNASeq to see if I manipulate the original solution (which is the found_in_0), is the DNASeq still present.
My question: For an unseen set of sequences, I want the output set of labels to be a multi-class probability of 'found_in_1', 'found_in_2', 'found_in_3'.
i.e. if the above example was the output from my test set, my output would ideally look like this:
DNA_seq Sample1Name Sample1Name ConcOfDNAInSample DNASeqFoundInProcessCat
AGGAG cat_0 cat_1 0.1 (0.9,0.5,0.1)
AGGAG cat_1 cat_2 0.4 (0.8,0.7,0.3)
ACCCC cat_1 cat_7 0.1 (0.2,0.5,0.3)
AGAGAGA cat_2 cat_10 1.9 (0.7,0.2,0.9)
ADAS cat_332 cat_103 8.9 (0.6,0.8,0.7)
There are some notes:
It is possible that because of the processes I am doing, that some sequences can NOT be in the original solution (found_in_0), but then because bits of DNA can stick together, they CAN subsequently be in the other classes (found_in_1, found_in_2, found_in_3)
I am only interested in the output for the found_in_1, found_in_2 and found_in_3 class (i.e. I want a three class probability at the end, not a four class probability with found_in_0).
I am able to generate other features from the DNA seqs, this is just an example.
I can see from my data, that my data set is unbalanced, the amount of data in found_in_3 is significantly lower than the others (my full training data is about 80,000 rows; but only about 10,000 of these rows are found_in_3; the others are all found_in_0, found_in_1 or found_in_2).
What I'm trying to work out is the algorithm, for one specific point in particular. My idea was:
1.Read in the data.
df = pd.read_csv('data')
2.Split the data set into train and test
import sklearn
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.2,random_state=42)
3.Understand the data set (i.e. that's where I saw the under-representation in point 4, above). I have a series of functions for this...so let's say I have a standardised data set which is the table above.
4.Build neural network.
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model
I know the general idea here would be the tensorflow equivalent of doing this in keras (i.e. this is for the 'iris' data set; where I initialise a model, add some layers and an activation function, compile the model,generate an output of the model and then fit the model and then predict after this (not shown)):
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(8,input_dim=4,activation='relu'))
model.add(Dense(8,input_dim=4,activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(x_train,y_train, epochs=150,verbose=0)
So I understand I want to replicate a similar set of steps for my data, and I'm trying to work out how to do this, what I can't understand is do I have to use tf.nn.sigmoid_cross_entropy_with_logits for this problem (since each input can belong to move than one label, i.e. can be present in found_in_1, found_in_2 and found_in_3, this can produce a probability output per class?)
Or can I just use a softmax function like this?