0

I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line :

Error : `data` and `reference` should be factors with the same levels

Here's my code:

library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)

#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)

#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)

#making sure the data is in the right format 
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))

#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)

So I've tried to do this as said in another topic:

confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))

But I still have an error:

Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Just to follow up. In cases where you have a large data file, it's possible to create a sample data set that reproduces your problem. Here is some [guidance](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how people do that. Having the data makes it much easier for the community to help you. – xilliam Feb 26 '21 at 10:33
  • Thank you ! I will apply the guidance – Nicolas Duaut Feb 26 '21 at 10:58

2 Answers2

1

Try to keep factor levels of train and test same as df.

train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you for your answer ! I place those lines after : 'train = subset(df, sample==TRUE) test = subset(df, sample == FALSE)' However when i make my confusion matrix, I have 4 zeros and an accuracy of 'NaN' – Nicolas Duaut Feb 26 '21 at 10:33
1

I made a toy data set and examined your code. There were a couple issues:

  1. R has a easier time with variable names that follow a certain style. Your 'Customer type' variable has a space in it. In general, coding is easier when you avoid spaces. So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use names(df) <- gsub("Customer type", "Customer_type", names(df)).
  2. I coded 'Customer_type' as a factor. For you this will look like df$Customer_type <- factor(df$Customer_type)
  3. The documentation for sample.split() says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could use levels(df$Customer_type). Input these to sample.split() as a character vector.
  4. Adjust the rpart() call as shown below.

With these adjustments, your code might be OK.

# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
                 Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
                 Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
                 Quantity = sample(1:10, 100, replace = T),
                 Total = sample(1:10, 100, replace = T),
                 Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
                 Rating = factor(sample(1:5, 100, replace = T)))

library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)

#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)
xilliam
  • 2,074
  • 2
  • 15
  • 27
  • Many thanks for your reply. I did what you said. Before splitting/training the data I've extracted labels from data to create a vector of data labels like this : customerlabel <- c(df$Customer_type, recursive = FALSE, use.names = TRUE). I input it in the sample.split function : sample = sample.split(customerlabel, SplitRatio = .70), test$Customer_type). Even though the accuracy is only 0,47, it's fine, thank you! – Nicolas Duaut Feb 26 '21 at 14:20