I'm trying to do semi-supervised learning for sentiment analysis using naive bayes in matlab. the data im using is IMBD reviews which have been labelled either positive or negative determining the sentiment of the review. I have loaded the data, cleaned it, and separated it into training and test sets. I'm trying to use the fitcnb function to train the multiclass naive Bayes model. however i keep getting errors as the variables i'm putting into the function seem to be wrong. my code is below.
% Load the 'IMBD_reviews.csv' dataset into a data table.
data = readtable('IMBD_reviews.csv');
% Extract the movie reviews from the data table.
reviews = data{:, 'review'};
sentimentLabels = categorical(data{:, 'sentiment'}, {'negative', 'positive'});
% clean up data removing unwanted characters etc.
cleanTextData = lower(reviews); % Convert the movie reviews to lowercase.
documents = tokenizedDocument(cleanTextData); % Tokenize the movie reviews.
documents = erasePunctuation(documents); % Erase punctuation.
documents = removeStopWords(documents); % Remove stop words.
% Separate data into training and test sets.
cv = cvpartition(size(documents, 1), 'HoldOut', 0.2); % 20% of the data will be used for testing.
idxTrain = training(cv); % Indices for the training set.
idxTest = test(cv); % Indices for the testing set.
documentsTrain = documents(idxTrain);
sentimentLabelsTrain = sentimentLabels(idxTrain);
documentsTest = documents(idxTest);
sentimentLabelsTest = sentimentLabels(idxTest);
% Convert to bag of words.
bag = bagOfWords(documentsTrain);
count =3;
bag = removeInfrequentWords(bag, count);
% Train the Naive Bayes classifier.
nb = fitcnb(bag, sentimentLabelsTrain);
% Predict sentiment of test data using trained classifier.
predictedLabels = predict(nb, documentsTest);
% Evaluate performance of classifier using the testing set.
accuracy = sum(predictedLabels == sentimentLabelsTest) / numel(sentimentLabelsTest);
%to do
% use testing set to evaluate performance of classifier
%use the testing set to evaluate performance of classifier
% combine labelled and predicted data and use them to train new classifier
% bootstrapping
%repeat steps above a couple times
%evaluate performance and use on testing
The error often says i need the X variable to be a numeric matrix however i though the bag of words method does this.