0

I am working on a way to classify mail by using Keras. I read the mail that have already been classified, tokenize them to create a dictionary which is link to a folder.

So I created a dataframe with pandas:

data = pd.DataFrame(list(zip(lst, lst2)), columns=['text', 'folder'])

The text column is where reside all the words present in an email and the folder column is the class (the path) that the email belongs to.

Thanks to that I created my model which gives me those results:

3018/3018 [==============================] - 0s 74us/step - loss: 0.0325 - acc: 0.9950 - val_loss: 0.0317 - val_acc: 0.9950

On 100 Epoch

The evaluation of my model

755/755 [==============================] - 0s 28us/step Test score: 0.0316697002592071 Test accuracy: 0.995000006268356

So the last that I need to do is predict the class of a random mail but the model.predict_classes(numpy.array) call gives me a 2D array full of integer but I still don't know to which "folder/class" it belongs.

Here is my code:

#lst contains all the words in the mail
#lst2 the class/path of lst
data = pd.DataFrame(list(zip(lst, lst2)), columns=['text', 'folder'])

train_size = int(len(data) * .8)
train_posts = data['text'][:train_size]
train_tags = data['folder'][:train_size]

test_posts = data['text'][train_size:]
test_tags = data['folder'][train_size:]

num_labels = 200 #The numbers of total classes

#the way I tokenize and encode my data
tokenizer = Tokenizer(num_words=len(lst))
tokenizer.fit_on_texts(pd.concat([train_posts, test_posts], axis = 1))

x_train = tokenizer.texts_to_matrix(train_posts, mode=TOKENISER_MODE)
x_test = tokenizer.texts_to_matrix(test_posts, mode=TOKENISER_MODE)

encoder = preprocessing.LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)

#my model, vocab_size = len(lst) = number of the words present in the mails
model = Sequential()
model.add(Dense(16, input_shape=(vocab_size,)))
model.add(Activation('elu'))
model.add(Dropout(0.2))
model.add(Dense(32))
model.add(Activation('elu'))
model.add(Dropout(0.2))
model.add(Dense(16))
model.add(Activation('elu'))
model.add(Dropout(0.2))
model.add(Dense(num_labels))
model.add(Activation('sigmoid'))
model.summary()

#compile training and evaluate
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=100, verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

#read the random file
sentences = read_files("mail.eml")
sentences = ' '.join(sentences)
sentences = sentences.lower()
salut = unidecode.unidecode(sentences)

#predict
pred = model.predict_classes(salut, batch_size=batch_size, verbose=1)
print(pred)

The actual output of pred:

[125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125]

I don't why but the output each time I launch it is always full of the same number. And the output I am looking for is:

['medecine/AIDS/', help/, project/classification/]

sorted by probabilities of being the right one. The read_files call is just a function that read the mail and return a list of all the words present in the mail.

Is there a way to obtain the class of the mail with model.predict_classes() or do I need to use something else?

today
  • 32,602
  • 8
  • 95
  • 115
  • An array full of `125` certainly signifies that something is *very* wrong - how can you take out anything meaningful from such an output? How many classes do you have (i.e. what is `num_labels`)? Is this a standard multi-class classification problem (i.e. samples can belong to one class only)? – desertnaut Aug 07 '19 at 12:32
  • num_labels which is equal to 200 is the number of classes that i have – Dimitri Felix Aug 07 '19 at 12:37
  • 1
    Please clarify (as already asked above) if your are in a multi-class setting (a sample can belong to one class only) or a multi-label one (a sample can belong to more than one classes simultaneously). – desertnaut Aug 07 '19 at 12:52
  • to one class only – Dimitri Felix Aug 07 '19 at 13:07
  • So you are **not** in a multi-label setting; my answer then holds... – desertnaut Aug 07 '19 at 13:08

2 Answers2

1

Is there a way to obtain the class of the mail with model.predict_classes() or do i need to use something else?

Arguably, there are much more severe issues with your code, as it should be already apparent from the output pred.

For starters, and assuming that you are in a multi-class setting (each sample can belong to one class only), and not in a multi-label one (where each sample can belong to more than one classes simultaneously):

  1. Since you are in a multi-class setting, the activation of your last layer should be softmax, and not sigmoid; so, you should change it to

    model.add(Activation('softmax'))
    
  2. Similarly, in your model compilation, you should ask for loss='categorical_crossentropy', and not binary_crossentropy (which is for binary classification problems); so

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    

Most probably, the high accuracies you get during training are flukes, and they do not reflect the reality. This is a known behavior with Keras when one erroneously uses binary_crossentropy loss with multi-class data - for details, see my answer in Keras binary_crossentropy vs categorical_crossentropy performance?.

After you have done the above, please open a new question with the new situation if you still have issues - as already implied, not being able to get the classes is the least of your problems right now.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Oh, I thought the OP is doing multi-label classification?!!! I wrote my answer with that assumption! – today Aug 07 '19 at 12:50
  • 1
    @today not quite sure myself (although I asked in the comments) - stby, I'm asking again to clarify... – desertnaut Aug 07 '19 at 12:51
  • I want to know all the labels that may correspond to the mail, i tried to change to softmax and categorical_crossentropy and it seems my old values were just garbage i barely have 1% now – Dimitri Felix Aug 07 '19 at 12:58
  • 1
    @DimitriFelix not unexpected, but please answer the **specific** question in my last comment above – desertnaut Aug 07 '19 at 12:59
  • If you are talking about the fact that i want to do multi-label classification the answer is yes, I'm very new to this kind of thing so i don't know the specific vocabulary – Dimitri Felix Aug 07 '19 at 13:02
  • So if i wanted to have multiples labels on my samples I should have use 'softmax' and 'categorical_crossentropy' that's what you mean – Dimitri Felix Aug 07 '19 at 13:11
  • No, you should do that if you **don't** have multiple labels (as you do here) – desertnaut Aug 07 '19 at 13:13
1

Note for future readers experiencing the same problem: Read @desertnaut's answer if each sample can belong to only one of the classes (e.g. either "cat" or "dog", not both). Otherwise (i.e. it is multi-label classification), read my answer.


The predict_classes is used for classification models where there is only one true class for each sample. However, it seems your model is a multi-label classification model (i.e. each sample may belong to zero, one or multiple classes). Therefore, to find the predicted labels you need to threshold the predicted values by 0.5 (because the output values indicate probability values, and if it is high enough, i.e. > 0.5, we can consider the corresponding class to be present in the input sample):

# this gives probability values, an array of shape (n_samples, n_labels)
preds_prob = model.predict(salut)

# this gives the name of classes with prob > 0.5
preds_cls = encoder.inverse_transform(preds_prob, 0.5)
today
  • 32,602
  • 8
  • 95
  • 115