1

After I trained my model for the toxic challenge at Keras the accuracy of the prediction is bad. I'm not sure if I'm doing something wrong, but the accuracy during the training period was pretty good ~0.98.

How I trained

import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

train = pd.read_csv('train.csv')


list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]

max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

inp = Input(shape=(maxlen, ))

embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

batch_size = 32
epochs = 2
print(X_t[0])
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

model.save("m.hdf5")

This is how I predict

model = load_model('m.hdf5')

list_sentences_train = np.array(["I love you Stackoverflow"])

max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(X_t)

print(model.predict(X_t))

Output

[[ 1.97086316e-02 9.36032447e-05 3.93966911e-03 5.16672269e-04 3.67353857e-03 1.28102733e-03]]

today
  • 32,602
  • 8
  • 95
  • 115
BilalReffas
  • 8,132
  • 4
  • 50
  • 71
  • Can a single sample have multiple labels (i.e. is it a multi-label classification task?), for example both "toxic" and "threat"? – today May 26 '19 at 12:18
  • No not really @today – BilalReffas May 26 '19 at 12:22
  • Then you should not use `sigmoid` as the activation function of last layer and `binary_crossentropy` as the loss function. Instead use `softmax` and `categorical_crossentropy`. See [this answer](https://stackoverflow.com/a/51892084/2099607). – today May 26 '19 at 12:25
  • Thanks, but its still strange. Always getting values around this [[ 0.68699586 0.00641587 0.13240167 0.00581519 0.15096234 0.01740919]] @today – BilalReffas May 26 '19 at 13:13
  • And what's strange about that exactly? Did you mean for **totally different** samples you get the **same prediction**? – today May 26 '19 at 13:34
  • It doesn't matter what the input is. The output is always around these example values. This is btw where I have the code from https://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras/data @today – BilalReffas May 26 '19 at 13:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193948/discussion-between-bilalreffas-and-today). – BilalReffas May 26 '19 at 14:03
  • Oh, good that you provided the link to the original problem. This is a multi-label problem per the definition provided in the challenge. And therefore, your original solution is correct. And also there is no problem with the prediction you get for the `I love you Stackoverflow` input with your original model: since all the values (i.e. probabilities) are near zero, then it means none of the labels are present in the given input, which is absolutely expected and correct. – today May 26 '19 at 14:05
  • Yeah that's true, but changing it to "I will kill you" give me following result [[ 1.24361124e-02 1.38680343e-05 1.83900306e-03 1.23052538e-04 1.84579729e-03 6.02254237e-04]]. So not understanding what I'm doing wrong @today – BilalReffas May 26 '19 at 14:07

1 Answers1

1

In inference (i.e. prediction) phase, you should use the same pre-processing steps you have used during training of the model. Therefore, you should not create a new Tokenizer instance and fit it on your test data. Rather, if you want to be able to do prediction later with the same model, besides the model you must also save all the statistics you obtained from the training data like the vocabulary in Tokenizer instance. Therefore it would be like this:

import pickle

# building and training of the model as you have done ...

# store all the data we need later: model and tokenizer    
model.save("m.hdf5")
with open('tokenizer.pkl', 'wb') as handler:
    pickle.dump(tokenizer, handler)

And now in prediction phase:

import pickle

model = load_model('m.hdf5')
with open('tokenizer.pkl', 'rb') as handler:
    tokenizer = pickle.load(handler)

list_sentences_train = ["I love you Stackoverflow"]

# use the the same tokenizer instance you used in training phase
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(model.predict(X_t))
today
  • 32,602
  • 8
  • 95
  • 115