I want to predict sales using linear regression. This is my data table used for modeling.
> store
Store Sales CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear Assortment_a
1: 3 8314 14130 12 2006 1 14 2011 1
2: 3 8977 14130 12 2006 1 14 2011 1
3: 3 7610 14130 12 2006 1 14 2011 1
4: 3 8864 14130 12 2006 1 14 2011 1
5: 3 8107 14130 12 2006 1 14 2011 1
---
775: 3 12247 14130 12 2006 1 14 2011 1
776: 3 4523 14130 12 2006 1 14 2011 1
777: 3 6069 14130 12 2006 1 14 2011 1
778: 3 5902 14130 12 2006 1 14 2011 1
779: 3 6823 14130 12 2006 1 14 2011 1
Assortment_b Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d DayOfWeek Open Promo SchoolHoliday DateYear DateMonth
1: 0 0 1 0 0 0 5 1 1 1 2015 7
2: 0 0 1 0 0 0 4 1 1 1 2015 7
3: 0 0 1 0 0 0 3 1 1 1 2015 7
4: 0 0 1 0 0 0 2 1 1 1 2015 7
5: 0 0 1 0 0 0 1 1 1 1 2015 7
---
775: 0 0 1 0 0 0 1 1 1 0 2013 1
776: 0 0 1 0 0 0 6 1 0 0 2013 1
777: 0 0 1 0 0 0 5 1 0 1 2013 1
778: 0 0 1 0 0 0 4 1 0 1 2013 1
779: 0 0 1 0 0 0 3 1 0 1 2013 1
DateDay DateWeek StateHoliday_0 StateHoliday_a StateHoliday_b StateHoliday_c CompetitionOpen PromoOpen IspromoinSales Prediction
1: 31 30 1 0 0 0 103 52.00 1 0
2: 30 30 1 0 0 0 103 52.00 1 0
3: 29 30 1 0 0 0 103 52.00 1 0
4: 28 30 1 0 0 0 103 52.00 1 0
5: 27 30 1 0 0 0 103 52.00 1 0
---
775: 7 1 1 0 0 0 73 20.75 1 0
776: 5 0 1 0 0 0 73 20.50 1 0
777: 4 0 1 0 0 0 73 20.50 1 0
778: 3 0 1 0 0 0 73 20.50 1 0
779: 2 0 1 0 0 0 73 20.50 1 0
>
Because I get an error of
contrasts can only be applied to factors with at least two levels
I applicate what @Scott said here because I don't have any NA values.
I need to know what are columns that should be converted as factor variables in the model.
> lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))
$Store
[1] "3"
$Sales
[1] "numeric"
$CompetitionDistance
[1] "14130"
$CompetitionOpenSinceMonth
[1] "12"
$CompetitionOpenSinceYear
[1] "2006"
$Promo2
[1] "1"
$Promo2SinceWeek
[1] "14"
$Promo2SinceYear
[1] "2011"
$Assortment_a
[1] "1"
$Assortment_b
[1] "0"
$Assortment_c
[1] "0"
$StoreType_a
[1] "1"
$StoreType_b
[1] "0"
$StoreType_c
[1] "0"
$StoreType_d
[1] "0"
$DayOfWeek
[1] "1"
$Open
[1] "1"
$Promo
[1] "0"
$SchoolHoliday
[1] "0"
$DateYear
[1] "numeric"
$DateMonth
[1] "numeric"
$DateDay
[1] "numeric"
$DateWeek
[1] "numeric"
$StateHoliday_0
[1] "1"
$StateHoliday_a
[1] "0"
$StateHoliday_b
[1] "0"
$StateHoliday_c
[1] "0"
$CompetitionOpen
[1] "numeric"
$PromoOpen
[1] "numeric"
$IspromoinSales
[1] "numeric"
$Prediction
[1] "numeric"
Then my model is shown below. Just look to the lm function how do I write it.
M<-matrix(0,nrow=10,ncol = 1)
store <- data[Store == 3,] # Pour sélectionner un magasin identifié par son numéro unique
shuffledIndices <- sample(nrow(store)) # Pour faire melanger les données et les réarranger
setDT(store)[,Prediction:=0]
z <- nrow(store)
for (i in 1:10)
{ # 10-fold cross-validation
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z) # 10 % de la totalité de la base est sélectionné
test <- store[shuffledIndices[sampleIndex],] # il est utilisé comme base de test
train <- store[shuffledIndices[-sampleIndex],] # il est utilisé comme base de train
modell <- lm(Sales ~ as.factor(CompetitionDistance) + as.factor(CompetitionOpenSinceMonth) + as.factor(CompetitionOpenSinceYear) +
as.factor(Promo2)+as.factor(Promo2SinceWeek)+as.factor(Promo2SinceYear)+as.factor(Assortment_a)+as.factor(Assortment_b)+as.factor(Assortment_c)+
as.factor(StoreType_a)+as.factor(StoreType_b)+as.factor(StoreType_c)+as.factor(StoreType_d)+as.factor(DayOfWeek)+as.factor(Open)+SchoolHoliday+
as.factor(Promo)+as.factor(StateHoliday_0)+as.factor(StateHoliday_a)+as.factor(StateHoliday_b)+as.factor(StateHoliday_c)+
as.factor(DateYear)+as.factor(DateMonth)+as.factor(DateDay)+as.factor(DateWeek)+as.factor(CompetitionOpen)+as.factor(PromoOpen)+as.factor(IspromoinSales),train) # a linear model is fitted to the training set
store[shuffledIndices[sampleIndex],Prediction:=predict(modell,test)] # predictions are generated for the test set based on the model
M[i,1]<-(round(sqrt(mean((store$Prediction-test$Sales)^2))/mean(test$Sales),4))
}
plot(1:10,M[,1],type='b',xlab="i",ylab="rmse%")
But I always get the error. It's really weird. How do you explain this please? Thank you in advance