0

I want to predict sales using linear regression. This is my data table used for modeling.

> store
     Store Sales CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear Assortment_a
  1:     3  8314               14130                        12                     2006      1              14            2011            1
  2:     3  8977               14130                        12                     2006      1              14            2011            1
  3:     3  7610               14130                        12                     2006      1              14            2011            1
  4:     3  8864               14130                        12                     2006      1              14            2011            1
  5:     3  8107               14130                        12                     2006      1              14            2011            1
 ---                                                                                                                                       
775:     3 12247               14130                        12                     2006      1              14            2011            1
776:     3  4523               14130                        12                     2006      1              14            2011            1
777:     3  6069               14130                        12                     2006      1              14            2011            1
778:     3  5902               14130                        12                     2006      1              14            2011            1
779:     3  6823               14130                        12                     2006      1              14            2011            1
     Assortment_b Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d DayOfWeek Open Promo SchoolHoliday DateYear DateMonth
  1:            0            0           1           0           0           0         5    1     1             1     2015         7
  2:            0            0           1           0           0           0         4    1     1             1     2015         7
  3:            0            0           1           0           0           0         3    1     1             1     2015         7
  4:            0            0           1           0           0           0         2    1     1             1     2015         7
  5:            0            0           1           0           0           0         1    1     1             1     2015         7
 ---                                                                                                                                
775:            0            0           1           0           0           0         1    1     1             0     2013         1
776:            0            0           1           0           0           0         6    1     0             0     2013         1
777:            0            0           1           0           0           0         5    1     0             1     2013         1
778:            0            0           1           0           0           0         4    1     0             1     2013         1
779:            0            0           1           0           0           0         3    1     0             1     2013         1
     DateDay DateWeek StateHoliday_0 StateHoliday_a StateHoliday_b StateHoliday_c CompetitionOpen PromoOpen IspromoinSales Prediction
  1:      31       30              1              0              0              0             103     52.00              1          0
  2:      30       30              1              0              0              0             103     52.00              1          0
  3:      29       30              1              0              0              0             103     52.00              1          0
  4:      28       30              1              0              0              0             103     52.00              1          0
  5:      27       30              1              0              0              0             103     52.00              1          0
 ---                                                                                                                                 
775:       7        1              1              0              0              0              73     20.75              1          0
776:       5        0              1              0              0              0              73     20.50              1          0
777:       4        0              1              0              0              0              73     20.50              1          0
778:       3        0              1              0              0              0              73     20.50              1          0
779:       2        0              1              0              0              0              73     20.50              1          0
> 

Because I get an error of

contrasts can only be applied to factors with at least two levels

I applicate what @Scott said here because I don't have any NA values.

I need to know what are columns that should be converted as factor variables in the model.

  > lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))
$Store
[1] "3"

$Sales
[1] "numeric"

$CompetitionDistance
[1] "14130"

$CompetitionOpenSinceMonth
[1] "12"

$CompetitionOpenSinceYear
[1] "2006"

$Promo2
[1] "1"

$Promo2SinceWeek
[1] "14"

$Promo2SinceYear
[1] "2011"

$Assortment_a
[1] "1"

$Assortment_b
[1] "0"

$Assortment_c
[1] "0"

$StoreType_a
[1] "1"

$StoreType_b
[1] "0"

$StoreType_c
[1] "0"

$StoreType_d
[1] "0"

$DayOfWeek
[1] "1"

$Open
[1] "1"

$Promo
[1] "0"

$SchoolHoliday
[1] "0"

$DateYear
[1] "numeric"

$DateMonth
[1] "numeric"

$DateDay
[1] "numeric"

$DateWeek
[1] "numeric"

$StateHoliday_0
[1] "1"

$StateHoliday_a
[1] "0"

$StateHoliday_b
[1] "0"

$StateHoliday_c
[1] "0"

$CompetitionOpen
[1] "numeric"

$PromoOpen
[1] "numeric"

$IspromoinSales
[1] "numeric"

$Prediction
[1] "numeric"

Then my model is shown below. Just look to the lm function how do I write it.

M<-matrix(0,nrow=10,ncol = 1)
store <- data[Store == 3,]  # Pour sélectionner un magasin identifié par son numéro unique
shuffledIndices <- sample(nrow(store))  # Pour faire melanger les données et les réarranger
setDT(store)[,Prediction:=0]
z <- nrow(store)
for (i in 1:10) 
{    # 10-fold cross-validation
  sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)  # 10 % de la totalité de la base est sélectionné
  test <- store[shuffledIndices[sampleIndex],]  # il est utilisé comme base de test
  train <- store[shuffledIndices[-sampleIndex],]  # il est utilisé comme base de train
  modell <- lm(Sales ~ as.factor(CompetitionDistance) + as.factor(CompetitionOpenSinceMonth) + as.factor(CompetitionOpenSinceYear) + 
                 as.factor(Promo2)+as.factor(Promo2SinceWeek)+as.factor(Promo2SinceYear)+as.factor(Assortment_a)+as.factor(Assortment_b)+as.factor(Assortment_c)+
                 as.factor(StoreType_a)+as.factor(StoreType_b)+as.factor(StoreType_c)+as.factor(StoreType_d)+as.factor(DayOfWeek)+as.factor(Open)+SchoolHoliday+
                 as.factor(Promo)+as.factor(StateHoliday_0)+as.factor(StateHoliday_a)+as.factor(StateHoliday_b)+as.factor(StateHoliday_c)+
                 as.factor(DateYear)+as.factor(DateMonth)+as.factor(DateDay)+as.factor(DateWeek)+as.factor(CompetitionOpen)+as.factor(PromoOpen)+as.factor(IspromoinSales),train)  # a linear model is fitted to the training set
  store[shuffledIndices[sampleIndex],Prediction:=predict(modell,test)] # predictions are generated for the test set based on the model
  M[i,1]<-(round(sqrt(mean((store$Prediction-test$Sales)^2))/mean(test$Sales),4))
}

plot(1:10,M[,1],type='b',xlab="i",ylab="rmse%")

But I always get the error. It's really weird. How do you explain this please? Thank you in advance

user8810618
  • 115
  • 11
  • It's difficult to say without a real reproducible example, but I guess as you are doing cross-validation you have folds where some factors only have one level. You should also check in your whole data set whether the columns your are using as a factor in your model have more than one level. – kath Mar 03 '18 at 10:32
  • @kath, thank you for your remark, but just look to the edited question you can see the base used for modeling . – user8810618 Mar 03 '18 at 10:45
  • it's give exactly this error, **Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can only be applied to factors with at least two levels** – user8810618 Mar 03 '18 at 10:47
  • In your sample data a lot of the columns have only one level (e.g.) `CompetitionDistance`. Check with `lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))` for your whole data if the columns you're using have more than one level. – kath Mar 03 '18 at 10:52
  • @kath, as you propose you can look at the edited code and I tried to convert those whose have one level to factor and those numeric I didn't modify their type. But I get the same error! What should I do then? – user8810618 Mar 03 '18 at 11:03
  • You try to build a model with variables which a constant for all observations. Thus you should exclude all those with only one value (e.g. `CompetitionDistance`, `CompetitionOpenSinceYear`, ...) from the model and then it should work. – kath Mar 03 '18 at 11:08
  • @Kath, yes it works in this case but you don't think that this will exclude many other variables ? – user8810618 Mar 03 '18 at 11:17

1 Answers1

2

The problem is that you have constant variables in your model. These variables don't add information and thus should excluded from the modelling process.
Why? You want to model Sales given all your other variables. As some of the variables are constant they don't provide any information how Sales changes, as these variables don't change.

If you modify your model in the following way, your code should work:

modell <- lm(Sales ~ as.factor(DayOfWeek) + SchoolHoliday + as.factor(Promo) + 
               as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + 
               as.factor(DateWeek) + as.factor(CompetitionOpen) + as.factor(PromoOpen), 
             data = train)

One additional remark:
You are transforming all your variables into factors. As for example PromoOpen seems to be a numeric variable, it might be better to keep this variable as numeric. This of course depends on your data and the desired interpretation of your model.

kath
  • 7,624
  • 17
  • 32