0

I included dummies in my data for the variables ''cut'' and ''clarity'':

diamonds_7$cut_fair = ifelse(diamonds_7$cut == "Fair" , 1 , 0)
diamonds_7$cut_good = ifelse(diamonds_7$cut == "Good" , 1 , 0)
diamonds_7$cut_very_good = ifelse(diamonds_7$cut == "Very Good" , 1 , 0)
diamonds_7$cut_premium = ifelse(diamonds_7$cut == "Premium" , 1 , 0)
diamonds_7$cut_ideal = ifelse(diamonds_7$cut == "Ideal" , 1 , 0)

diamonds_7$clarity_SI2 = ifelse(diamonds_7$clarity == "SI2" , 1 , 0)
diamonds_7$clarity_SI1 = ifelse(diamonds_7$clarity == "SI1" , 1 , 0)
diamonds_7$clarity_VS1 = ifelse(diamonds_7$clarity == "VS1" , 1 , 0)
diamonds_7$clarity_VS2 = ifelse(diamonds_7$clarity == "VS2" , 1 , 0)
diamonds_7$clarity_VVS2 = ifelse(diamonds_7$clarity == "VVS2" , 1 , 0)
diamonds_7$clarity_VVS1 = ifelse(diamonds_7$clarity == "VVS1" , 1 , 0)
diamonds_7$clarity_I1 = ifelse(diamonds_7$clarity == "I1" , 1 , 0)
diamonds_7$clarity_IF = ifelse(diamonds_7$clarity == "IF" , 1 , 0)

However, if I use the lm function for my regression - the variables "clarity_I1" and "cut_fair" are not part of the output. This is my lm function:

lmTotal <- lm(train$price~train$carat+train$depth+train$table+train$cut+train$clarity, data=train)

Output:

Call:
lm(formula = train$price ~ train$carat + train$depth + train$table + 
    train$cut + train$clarity, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9186.3  -638.7  -108.8   490.3 11151.1 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -3154.082    481.143  -6.555 5.62e-11 ***
train$carat         8550.941     14.963 571.476  < 2e-16 ***
train$depth          -39.271      5.234  -7.504 6.34e-14 ***
train$table          -28.346      3.869  -7.327 2.40e-13 ***
**train$cutGood        580.360     43.904  13.219  < 2e-16 ***
train$cutIdeal       844.243     43.675  19.330  < 2e-16 ***
train$cutPremium     748.720     42.050  17.805  < 2e-16 ***
train$cutVery Good   750.348     42.075  17.834  < 2e-16 ***
train$clarityIF     4896.382     67.915  72.096  < 2e-16 ***
train$claritySI1    3201.970     58.273  54.948  < 2e-16 ***
train$claritySI2    2365.312     58.607  40.359  < 2e-16 ***
train$clarityVS1    4056.855     59.433  68.259  < 2e-16 ***
train$clarityVS2    3837.326     58.585  65.501  < 2e-16 ***
train$clarityVVS1   4611.661     62.801  73.432  < 2e-16 ***
train$clarityVVS2   4575.588     61.204  74.760  < 2e-16 *****
  • Just s side note: when you give `lm()` the `data = train` argument, you don't need to use `train$` for all your variables. The point of `data = train` is telling `lm` where to look for the columns so you can do `lm(price ~ carat + depth +..., data = train)` – Gregor Thomas Jun 07 '22 at 18:59
  • Also, `lm()` (and most all modeling functions in R) will convert categorical variables to dummy variables automatically, so you will get the same results with `lm(price ~ carat + depth + table + cut + clarity, data = train)`. – Gregor Thomas Jun 07 '22 at 19:00
  • As for you main question, if you dummy all levels of a variable, then there will be linear dependence with the intercept. Typically the first "reference" level of a variable is not fit with its own coefficient, the `(Intercept)` incorporates this part of the fit. Any decent tutorial on regression should explain this nicely - look for the term "reference level". – Gregor Thomas Jun 07 '22 at 19:03
  • @GregorThomas thank you very much for your answer and lessons!! I did not know that. Could you please explain how to use the (Intercept) function in my case? I looked a lot of tutorials, but I did not find the answer yet.. Thank you in advance for your help, really appreciated!! – Fleur Hoogendoorn Jun 07 '22 at 19:08
  • The model results you show has an `(Intercept)` term. It is there already. You do not need to do anything other than adjust your expectations. The answer at the linked duplicate gives a decent start, but you want more info you may want to find a textbook to be more thorough. – Gregor Thomas Jun 07 '22 at 19:09

0 Answers0