I'm trying to do classification of a list X
of vectors of length 200 that contain integer values chosen from a dictionary vocab
of length 100 features as either belonging to class 0 or 1. Here's an example of my input data:
X=[[1,22,77,54,51,...],[2,3,1,41,3,...],[12,17,31,4,12,...],...]
y=[0,1,1,...]
So for example np.array(X).shape=(1000,200)
and y.shape=(200,)
. The classes are split 50-50. I did a standard train_test split into (X,y) and (X_test,y_test).
My model is:
from keras import layers as L
model = keras.models.Sequential()
model.add(L.Embedding(input_dim=100+1,output_dim=32,\
input_length=200))
model.add(L.SimpleRNN(64))
model.add(L.Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',\
metrics=['accuracy'])
model.fit(X,y,batch_size=128,epochs=20,validation_data=(X_test,y_text))
This works fairly well when I fit it to training and testing data. However, I wanted to try skipping the embedding, since I have a "small" space of features (9026). I normalized the training data by dividing by 9206.
and tried to build the simple RNN model as follows:
model = keras.models.Sequential()
model.add(L.InputLayer(200,1))
model.add(L.SimpleRNN(64))
model.add(L.Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',\
metrics=['accuracy'])
model.fit(X[:,:,np.newaxis],y,batch_size=128,epochs=20,validation_data=(X_test[:,:,np.newaxis],y_text))
I have to add the np.newaxis
to get the model to compile. When I fit this to the data, I always get training and validation accuracies of 0.5, which is my fraction of class 0 to class 1. I've tried different activations, different optimizers, different numbers of units in the RNN, different batch sizes, LSTM, GRU, adding dropout, multiple layers....nothing works.
My questions are:
I have vectors of fixed length (200) to classify, with a vocabulary of just 100 features. Shouldn't one be able to do this without an embedding?
Does anyone have useful advice for getting the non-embedding model to actually train?