4

I recently used Bag-of-Words classifier to make a Document Matrix with 96% terms. Then I used a Decision Tree to train by model on the bag of words input to make a prediction whether the sentence is important or not. The model performed really well on the test dataset, but when I used an out-of sample dataset, it is not able to predict. Instead it gives error.

Here's the model that I made in R

library('caTools')
library('tm')
library('rpart')
library(rpart.plot)
library(ROCR)

data= read.csv('comments.csv', stringsAsFactors = FALSE)
corpus = Corpus(VectorSource(data$Word))

# Pre-process data
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stemDocument)

# Create matrix
dtm = DocumentTermMatrix(corpus)

# Remove sparse terms
#dtm = removeSparseTerms(dtm, 0.96)
# Create data frame
labeledTerms = as.data.frame(as.matrix(dtm))

# Add in the outcome variable
labeledTerms$IsImp = data$IsImp 

#Splitting into train and test data using caTools

set.seed(144)

spl = sample.split(labeledTerms$IsImp , 0.60)

train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)

#Build CART Model
CART = rpart(IsImp ~., data=train, method="class")

This works totally fine on the testing dataset which around 83% accuracy. However, when I use this cart model to predict on a out of sample dataset, it gives me error.

head(train)
terms A B C D E F..............(n terms)
Freqs 0 1 2 1 3 0..............(n terms)

head(test)
terms A B C D E F..............(n terms)
Freqs 0 0 1 1 1 0..............(n terms)


data_random = read.csv('comments_random.csv', stringsAsFactors = FALSE)

head(data_random)
terms A B D E F H..............(n terms)
Freqs 0 0 1 1 1 0..............(n terms)

The error I get is "can't find C" in data_random. I don't know what I should do to make this work. Is laplace smoothing a way here??

smci
  • 32,567
  • 20
  • 113
  • 146
nEO
  • 5,305
  • 3
  • 21
  • 25
  • 4
    This error is not [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) because we do not have "comments.csv". See the inline link for tips on creating complete, minimal reproducible examples so it's easier to help you. – MrFlick Oct 08 '14 at 19:49

2 Answers2

2

The problem is that C is part of your training set. Therefore it is considered for the model. This means to make a prediction on a dataset, it is required that there is a value for C.

Your test set has no C. You need to add a column saying that there is 0 C's in the test set.

Felix
  • 309
  • 2
  • 12
1

It is extremely good that this "error" is adressed. Because as @Felix suggest this error occurs simply because you are lacking a variable in the prediction dataset. Therefore the error is rather redundant and correcting it has nothing to do with laplace corrections etc. You simply need to make sure that you have the same variables in your training dataset AND your prediction dataset. This can fx. be done with:

names(trainingdata) %in% names(predictiondata)

... And some additional code

Now, the reason why I think that the error is interesting is because it touches on a fundamental discussion on how to actually approach the modelling of text data. Because if you simply add the variables that are missing to the prediction data (i.e. C) and fills the cells with zeroes, you get a completely redundant variable that is only filling up space and memory. This means that you might as well kick the variable out of the TRAINING DATA instead of the prediction data.

However, the better way to approach the problem is to generate the bag-of-words, based on both the training data and the prediction data, and hereafter separate the data into a training set and a prediction set. This will take care of your problem AND at the same time be more theoretically "correct" because you base generate your bag-of-words on a bigger proportion of the total population of samples (i.e. texts)

That is my take on it. I hope it helps!

Kasper Christensen
  • 895
  • 3
  • 10
  • 30