How to use the fit method with an array of messages

Question

I'm trying to train and test multinomial bayes on a dataset, split accordingly. After processing the data I have an array of messages and an array of labels. I'm trying to use .fit() and .predict() with this data but it isn't working.

My data looks like:

emails = ['example mail', 'another example mail', ..]
labels = ['ham', 'spam', ..]

This is what I'm currently trying:

bayes = sklearn.linear_model.MultinomialNB().fit(emails, labels)

score 0 · Answer 1 · answered May 10 '19 at 15:24

You need to do some more processing on your data before training your model. The model won't work directly on pure strings. You can use any nlp libraries (I recommend Spacy, or nltk stanford) to process you data (ex: tokenize, lemmatization and getting the general idea of the paragraph...) I suggest you add a one hot encoding get_dummies()

score 0 · Answer 2 · answered May 10 '19 at 15:30

Please post complete info, e.g. the actual error

First off, Sklearn .fit() takes two arguments, 1) a 2D array of training data, and 2) a 1D array of labels/targets. So you need to reformat your input variable to a list-of-lists (even though you only have 1 predictor variable)

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
emails = [['example mail'], ['another example mail']]
labels = ['ham', 'spam']
model.fit(emails, labels)

Once this runs you will get another error:

TypeError: '<' not supported between instances of 'numpy.ndarray' and 'int'

Because you are passing words into a mathematical equation. As per the linked implementation detail in the documentation https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html you should be building your own parameters out of the input texts (e.g. word counts, sentiment analysis, binarized features (is this present/absent) etc.

If you properly format your data, generating character count and word count for each email you may have something workable:

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
emails = [['example mail'], ['another example mail']]
emails = [[len(x[0]), len(x[0].split())] for x in emails]
labels = ['ham', 'spam']
model.fit(emails, labels)

then if you pass similarly formated test data you will get an output

emails = [['test mail'], ['another test mail']]
emails = [[len(x[0]), len(x[0].split())] for x in emails]
model.predict(emails)

Out[17]: array(['ham', 'spam'], dtype='<U4')

Since both test mails are the same word count, and a similar character count you get the expected ham/spam output

Ashargin · Answer 3 · 2019-05-10T15:54:46.503

No machine learning algorithm can process text as input or output. Both will always be numerical. You need to find a way to map both your inputs (emails) and outputs (labels) to numerical values in a way that preserves information.

About your labels, what you need to do is probably map each label to a different integer and later use them to train a classifier :

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_labels = le.fit_transform(labels)

Sklearn models can do this work themselves though so this may not be necessary.

About your inputs it can get a lot trickier. This is where your real problem is.

You could do one-hot encoding, but it would be a very bad idea because then two mails can only be either the same or different (and most of the time they will be different), no matter how similar they may be, and also your input would be very high-dimensional.

word2vec (https://radimrehurek.com/gensim/models/word2vec.html) is a way to map words to vectors by analyzing the logics between the words of your sentences. Contrarily to one-hot encoding, similar mails will include some of the same words or sentence structures and will be close in the latent space even if they're not perfectly equal. This means that information can be preserved.

You just need to find a way to map your sentences to vectors, since your sentences are your inputs, based on the word vectors. Maybe concatenate the word vectors of the words inside your sentence and use 0-padding so that all sentences, even with a different number of words, are mapped to a vector with the same length.

But it may not work well, and it is not obvious to find a way to map your sentences to vectors in a way that preserves information, even if you have mapped your words with word2vec. See this post : How to calculate the sentence similarity using word2vec model of gensim with python.

Another way would be to try out the doc2vec implementation instead of word2vec. This is designed to map whole sentences to vectors, not just words, and this is exactly what you want to do.

How to use the fit method with an array of messages

3 Answers3