Good accuracy on validation and test but bad predictions keras lstm

Question

I'm having trouble with LSTM and Keras.

I try to predict normal/fake domain names.

My dataset is like this:

domain,fake
google, 0
bezqcuoqzcjloc,1
...

with 50% normal and 50% fake domains

Here's my LSTM model:

def build_model(max_features, maxlen):
    """Build LSTM model"""
    model = Sequential()
    model.add(Embedding(max_features, 128, input_length=maxlen))
    model.add(LSTM(64))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['acc'])

    return model

then I preprocess my text data to transform it into numbers:

"""Run train/test on logistic regression model"""
indata = data.get_data()

# Extract data and labels
X = [x[1] for x in indata]
labels = [x[0] for x in indata]

# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X)))}

max_features = len(valid_chars) + 1
maxlen = 100

# Convert characters to int and pad
X = [[valid_chars[y] for y in x] for x in X]
X = sequence.pad_sequences(X, maxlen=maxlen)

# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels]

Then I split my data into training, testing and validation sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print("Build model...")
model = build_model(max_features, maxlen)

print("Train...")
X_train, X_holdout, y_train, y_holdout = train_test_split(X_train, y_train, test_size=0.2)

And then I train my model on training data and validation data, and evaluate on test data:

history = model.fit(X_train, y_train, epochs=max_epoch, validation_data=(X_holdout, y_holdout), shuffle=False)

scores = model.evaluate(X_test, y_test, batch_size=batch_size)

At the end of my training/testing I have these results:

And these scores when evaluating on test dataset:

loss = 0.060554939906234596
accuracy = 0.978109902033532

However when I predict on a sample of the dataset like this:

LSTM_model = load_model('LSTMmodel_64_sgd.h5')
data = pickle.load(open('traindata.pkl', 'rb'))

#### LSTM ####

"""Run train/test on logistic regression model"""

# Extract data and labels
X = [x[1] for x in data]
labels = [x[0] for x in data]

X1, _, labels1, _ = train_test_split(X, labels, test_size=0.9)

# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X1)))}

max_features = len(valid_chars) + 1
maxlen = 100

# Convert characters to int and pad
X1 = [[valid_chars[y] for y in x] for x in X1]
X1 = sequence.pad_sequences(X1, maxlen=maxlen)

# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels1]

y_pred = LSTM_model.predict(X1)

I have very poor performance:

accuracy = 0.5934741842730341
confusion matrix = [[25201 14929]
                    [17589 22271]]
F1-score = 0.5780171295094731

Can someone explain to me why? I have tried 64 instead of 128 for the LSTM node, adam and rmsprop for optimizers, increasing batch_size however performance remains very low.

Can you please share the values for `max_features` both time you calculate it (i.e. before fitting the model, and before predicting)? — desertnaut, Aug 31 '18 at 15:32
@desertnaut I have `max_features = 39` before fitting and `max_features = 38` before predicting. — Laure D, Aug 31 '18 at 15:38
Thanx; and roughly how many samples you have in `X` (before splitting)? — desertnaut, Aug 31 '18 at 15:55
What is your batch size currently? If you are getting very good accuracy in sample and poor accuracy out of sample, it may be due to overfitting, in which case you want to decrease your batch size. — Alerra, Aug 31 '18 at 16:19
@Alerra batch_size is 640. Before I had 128 but I had the same kind of results. I suspect overfitting indeed but I don't know how to stop it — Laure D, Aug 31 '18 at 18:00
Try lowering your batch size to 32. Although it may seem counter-intuitive, lowering batch_size can actually increase the accuracy of your model for test data at the expense of accuracy of your training data. Doing this, your accuracy on the train data will probably go down, but your accuracy on the test data (which is the one that shows that learning has taken place) should go up (again, this assumes that your code is accurate.) — Alerra, Aug 31 '18 at 18:04
This question explains this well: https://stats.stackexchange.com/questions/185911/why-are-bias-nodes-used-in-neural-networks — Alerra, Aug 31 '18 at 18:16
@Alerra I tried with `batch_size = 32` and I still get poor performance. — Laure D, Sep 03 '18 at 09:43

score 0 · Accepted Answer · answered Sep 14 '18 at 09:14

0

Ok so I have found the answer.

This is this line

valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X1)))}

In Python 3 set seems to produce different results everytime a new python3 console is open.

So running the code in Python 2 has resolved my issues !

answered Sep 14 '18 at 09:14

Laure D

857
2
9
16

A [`set`](https://docs.python.org/2/library/sets.html) is an _unordered_ collection of unique elements, so yes, that would make sense. Running it in Py2 and relying on what is most likely an implementation detail, however, does not. You could do e.g. [this](https://stackoverflow.com/a/13902835/4316405) instead. – Nelewout Sep 14 '18 at 09:21

Good accuracy on validation and test but bad predictions keras lstm

1 Answers1