2

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.

I'm attempting this with Quanteda and have the following code:

library(quanteda)

bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)

It seems to work smoothly until predict(), which gives:

Error in newdata %*% log.lik : 
  requires numeric/complex matrix/vector arguments

Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!

Here is a link to the dataset.

Matt
  • 85
  • 6
  • You should provide enough data to make your example [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It likely has something to do with your data but since we can't see that it's impossible to say for sure. – MrFlick May 02 '16 at 03:57
  • @MrFlick I've edited the post to include a direct link to the .csv file. Is there any additional information I should be providing? New to this! – Matt May 02 '16 at 04:03
  • `newdata` the second argument to `predict()` cannot be a factor, which `test class` is, instead it needs to be a dfm. See `??predict.textmodel_NB_fitted`. If your final line is `predict(bbcNb)` should work - but doesn't. Apparently there is a bug in the predict method when *k* >2. Please file an issue at https://github.com/kbenoit/quanteda/issues. – Ken Benoit May 02 '16 at 23:38
  • Thanks @KenBenoit! If I wanted to keep the `newdata` argument for `predict()`, what would be the proper way to convert `testclass`? Would it be `testclass_dfm <- dfm(as.matrix(testclass))`? Doing so gives the following error using `predict()`: "Error in newdata %*% log.lik : Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90" – Matt May 03 '16 at 00:59

1 Answers1

4

As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:

library("quanteda")

text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)

all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)

You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:

bbc_pred <- predict(bbcNb)

Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:

library(caret)

confusionMatrix(
    bbc_pred$docs$predicted[1781:2225],
    all_classes[1781:2225]
)

However, as @ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:

docvars(bbc_corpus)$category <- factor(
    ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)

(note that this must be done before you extract all_classes from bbc_corpus above).

Ken Benoit
  • 14,454
  • 27
  • 50
Adam Obeng
  • 1,512
  • 10
  • 13