0

I'm trying to do classification of a list X of vectors of length 200 that contain integer values chosen from a dictionary vocab of length 100 features as either belonging to class 0 or 1. Here's an example of my input data:

X=[[1,22,77,54,51,...],[2,3,1,41,3,...],[12,17,31,4,12,...],...]
y=[0,1,1,...]

So for example np.array(X).shape=(1000,200) and y.shape=(200,). The classes are split 50-50. I did a standard train_test split into (X,y) and (X_test,y_test).

My model is:

from keras import layers as L
model = keras.models.Sequential()
model.add(L.Embedding(input_dim=100+1,output_dim=32,\
                      input_length=200))
model.add(L.SimpleRNN(64))           
model.add(L.Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',\
              metrics=['accuracy'])

model.fit(X,y,batch_size=128,epochs=20,validation_data=(X_test,y_text))

This works fairly well when I fit it to training and testing data. However, I wanted to try skipping the embedding, since I have a "small" space of features (9026). I normalized the training data by dividing by 9206. and tried to build the simple RNN model as follows:

model = keras.models.Sequential()
model.add(L.InputLayer(200,1))
model.add(L.SimpleRNN(64))           
model.add(L.Dense(1,activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',\
              metrics=['accuracy'])
model.fit(X[:,:,np.newaxis],y,batch_size=128,epochs=20,validation_data=(X_test[:,:,np.newaxis],y_text))

I have to add the np.newaxis to get the model to compile. When I fit this to the data, I always get training and validation accuracies of 0.5, which is my fraction of class 0 to class 1. I've tried different activations, different optimizers, different numbers of units in the RNN, different batch sizes, LSTM, GRU, adding dropout, multiple layers....nothing works.

My questions are:

  1. I have vectors of fixed length (200) to classify, with a vocabulary of just 100 features. Shouldn't one be able to do this without an embedding?

  2. Does anyone have useful advice for getting the non-embedding model to actually train?

AstroBen
  • 813
  • 2
  • 9
  • 20

1 Answers1

1

A recurrent layer requires inputs of shape (batch_size, timesteps, input_dim) where input_dim is the number of categories in your input data and those have be one-hot encoded, e.g. [1, 3], [0, 2] becomes [[0, 1, 0, 0], [0, 0, 0, 1]], [[1, 0, 0, 0], [0, 0, 1, 0]].

Now your data is of shape (batch_size, timesteps) and sparse encoded which means that the position of the 1 like in the encoding above is given implicitly by the category number. Just adding a new axis to the array brings it to the correct shape, so Keras won't raise any error but the data is not encoded correctly and thus your training obviously doesn't work at all.

It actually works with an Embedding layer because in opposite to a Recurrent layer, the embedding one expects input of the given shape and encoding (compare the input shape of RNN with the one of Embedding).

To solve this problem, you just have to one-hot encode your data. Keras provides the very convenient to_categorical util function for this, but you also might do it by hand.

randhash
  • 463
  • 4
  • 15
  • thanks!!! Oddly enough, I knew that but just didn't think of it. *derp* – AstroBen Jul 16 '18 at 12:53
  • Glad to help. If this resolved your problem, could you please mark the answer as accepted and/or upvote it :) – randhash Jul 16 '18 at 15:33
  • Going to do both. I have another quick question, somewhat related. Using one-hot vectors for a space of 100 features is very wasteful in terms of memory...is there any reason why I couldn't create my own embedding where each value from 0-100 was converted into binary and stored as an array of length 7, i.e. `0=[0,0,0,0,0,0,0]`, `1=[1,0,0,0,0,0,0]`, ..., `100=[1,1,0,0,1,0,0]`? – AstroBen Jul 16 '18 at 18:07
  • I think there is no way to pass a sparse matrix to a RNN layer directly so it definitely has to be 1-hot encoded. Create an own embedding for this is a bit overkill I think. You might better write a generator that does the encoding *on demand* and use it with the `fit_generator` method. It should be even sufficient to just write a wrapper around `to_categorical` that encodes and yields a single batch consecutively although `to_categorical` is not very efficient for this purpose for various reasons. This should reduce memory consumption drastically. – randhash Jul 17 '18 at 08:58