0

So I have a specific error that I can't figure out. By searching I am finding that the model and the cross validation set do not have the data with the same levels to fit the model. I am trying to understand completely with my use case. Basically I am building a QDA model to predict vehicle country based on numeric values. This code will run for anyone since it is a public google sheets document. For those of you who follow Doug Demuro on YouTube you may find this a tad bit interesting.

#load dataset into r
library(gsheet)
url = 'https://docs.google.com/spreadsheets/d/1KTArYwDWrn52fnc7B12KvjRb6nmcEaU6gXYehWfsZSo/edit'
doug_df = read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE,header=FALSE)

#begin cleanup. remove first blank rows of data
doug_df = doug_df[-c(1,2,3), ]
attach(doug_df)

#name columns appropriately
names(doug_df) = c("year","make","model","styling","acceleration","handling","fun factor","cool factor","total weekend score","features","comfort","quality","practicality","value","total daily score","dougscore","video duration","filming city","filming state","vehicle country")

#removing categorical columns and columns not being used for discriminate analysis to include totals columns
library(dplyr)
doug_df = doug_df %>% dplyr::select (-c(make,model,`total weekend score`,`total daily score`,dougscore,`video duration`,`filming city`,`filming state`))

#convert from character to numeric
num.cols <- c("year","styling","acceleration","handling","fun factor","cool factor","features","comfort","quality","practicality","value")
doug_df[num.cols] <- sapply(doug_df[num.cols], as.numeric)
`vehicle country` = as.factor(`vehicle country`)

#create a new column to reflect groupings for response variable
doug_df$country.group=ifelse(`vehicle country`=='Germany','Germany',
                          ifelse(`vehicle country`=='Italy','Italy',
                                 ifelse(`vehicle country`=='Japan','Japan',
                                        ifelse(`vehicle country`=='UK','UK',
                                               ifelse(`vehicle country`=='USA','USA','Other')))))

#remove the initial country column
doug_df = doug_df %>% dplyr::select (-c(`vehicle country`))

#QDA with multiple predictors
library(MASS)
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df)

#predict using model and compute error
n=dim(doug_df)[1]
fittedclass = predict(qdafit1,data=doug_df)$class
table(doug_df$country.group,fittedclass)
Error = sum(doug_df$country.group != fittedclass)/n; Error

#conduct k 10 fold cross validation
allpredictedCV1 = rep("NA",n)

cvk = 10
groups = c(rep(1:cvk,floor(n/cvk)))
set.seed(4)
cvgroups = sample(groups,n,replace=TRUE)

for (i in 1:cvk)  {
  qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df,subset=(cvgroups!=i))
  newdata1i = data.frame(doug_df[cvgroups==i,])
  allpredictedCV1[cvgroups==i] = as.character(predict(qdafit1,newdata1i)$class)
}
table(doug_df$country.group,allpredictedCV1)
CVmodel1 = sum(allpredictedCV1!=doug_df$country.group)/n; CVmodel1

This is throwing the error for the last part of the code w/ the cross validation:

Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

Can someone help explain it a bit more in depth to me what is happening? I think that the variable fun factor doesn't have the same levels in each fold of the cross validation as it did the model. Now I need to know my options to fix it. Thanks in advance!

EDIT

In addition to the above, I am getting a very similar error for when I try to predict a dummy car review.

#build a dummy review and predict it using multiple models
dummy_review = data.frame(year=2014,styling=8,acceleration=6,handling=6,`fun factor`=8,`cool factor`=8,features=4,comfort=4,quality=6,practicality=3,value=5)

#predict vehicle country for dummy data using model 1
predict(qdafit1,dummy_review)$class

This returns the following error:

Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

OTStats
  • 1,820
  • 1
  • 13
  • 22
Marc
  • 11
  • 2

0 Answers0