0

Is there a way to write a shorthand formula for all but few variable?

e.g.,

Instead of

modreg_trein <- lm(Life.expectancy ~ Status + Life.expectancy + Adult.Mortality + infant.deaths  + percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + Polio + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Income.composition.of.resources + Schooling , life_2015_clean)

I would like to write something like

modreg_trein <- lm(Life.expectancy ~ . - Alcohol - Total.expenditure, data = life_2015_clean)

EDIT: MWE

Data available in: https://www.kaggle.com/augustus0498/life-expectancy-who?select=led.csv

Procedure to reproduction:

  life <- read.csv('./data/csv/Life_Expectancy_Data.csv')
  life_2015 <- subset(life, Year=="2015")
life_2015_clean <- subset(life_2015, select=-c(Country, Year))
life_2015_clean$Status <- as.numeric(as.factor(life_2015_clean$Status))

Finally, manually inputting all variables but Alcohol and Total.expenditure, gives a successful regression.

modreg_trein <- lm(Life.expectancy ~ Status + Adult.Mortality + infant.deaths  + percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + Polio + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Income.composition.of.resources + Schooling , life_2015_clean)
  summary(modreg_trein)
Call:
lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
    percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + 
    Polio + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + 
    thinness.5.9.years + Income.composition.of.resources + Schooling, 
    data = life_2015_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.3326 -1.4047  0.0247  1.5478  7.9440 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      5.083e+01  3.024e+00  16.807  < 2e-16 ***
Status                          -3.824e-01  8.332e-01  -0.459   0.6472    
Adult.Mortality                 -2.077e-02  3.619e-03  -5.738 8.29e-08 ***
infant.deaths                    6.626e-02  3.291e-02   2.013   0.0465 *  
percentage.expenditure           5.575e-03  7.362e-03   0.757   0.4505    
Hepatitis.B                      4.311e-02  2.264e-02   1.904   0.0595 .  
Measles                         -5.027e-05  5.741e-05  -0.876   0.3832    
BMI                             -9.085e-03  1.554e-02  -0.585   0.5600    
under.five.deaths               -4.811e-02  2.359e-02  -2.040   0.0437 *  
Polio                            1.179e-02  1.271e-02   0.928   0.3553    
Diphtheria                      -1.148e-02  2.636e-02  -0.435   0.6641    
HIV.AIDS                        -4.858e-01  2.243e-01  -2.166   0.0324 *  
GDP                              5.950e-06  3.011e-05   0.198   0.8437    
Population                      -7.918e-10  9.586e-09  -0.083   0.9343    
thinness..1.19.years            -1.192e-01  2.343e-01  -0.509   0.6119    
thinness.5.9.years              -2.030e-02  2.291e-01  -0.089   0.9296    
Income.composition.of.resources  3.331e+01  4.991e+00   6.674 9.93e-10 ***
Schooling                       -5.244e-02  2.407e-01  -0.218   0.8279    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.7 on 112 degrees of freedom
  (53 observations deleted due to missingness)
Multiple R-squared:  0.901, Adjusted R-squared:  0.886 
F-statistic: 59.99 on 17 and 112 DF,  p-value: < 2.2e-16

But, this doesn't:

modreg_trein <- lm(Life.expectancy ~ . - Alcohol - Total.expenditure, life_2015_clean)
summary(modreg_trein)

Output:


Call:
lm(formula = Life.expectancy ~ . - Alcohol - Total.expenditure, 
    data = life_2015_clean)

Residuals:
ALL 2 residuals are 0: no residual degrees of freedom!

Coefficients: (16 not defined because of singularities)
                                Estimate Std. Error t value Pr(>|t|)
(Intercept)                     82.81164        NaN     NaN      NaN
Status                                NA         NA      NA       NA
Adult.Mortality                 -0.06772        NaN     NaN      NaN
infant.deaths                         NA         NA      NA       NA
percentage.expenditure                NA         NA      NA       NA
Hepatitis.B                           NA         NA      NA       NA
Measles                               NA         NA      NA       NA
BMI                                   NA         NA      NA       NA
under.five.deaths                     NA         NA      NA       NA
Polio                                 NA         NA      NA       NA
Diphtheria                            NA         NA      NA       NA
HIV.AIDS                              NA         NA      NA       NA
GDP                                   NA         NA      NA       NA
Population                            NA         NA      NA       NA
thinness..1.19.years                  NA         NA      NA       NA
thinness.5.9.years                    NA         NA      NA       NA
Income.composition.of.resources       NA         NA      NA       NA
Schooling                             NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
  (181 observations deleted due to missingness)
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 1 and 0 DF,  p-value: NA

BuddhiLW
  • 608
  • 3
  • 9
  • 2
    That should work. Does it not? What error do you get? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that can be used to test and verify possible solutions. – MrFlick Oct 20 '21 at 21:52
  • 1
    Possible dupicate of https://stackoverflow.com/questions/22580379/how-do-i-exclude-specific-variables-from-a-glm-in-r or https://stackoverflow.com/questions/5251507/how-to-succinctly-write-a-formula-with-many-variables-from-a-data-frame? – Martin Gal Oct 20 '21 at 21:52
  • I think the problem is different. But, shouldn't @MartinGal – BuddhiLW Oct 20 '21 at 22:28
  • 1
    Interessting. My guess: `Alcohol` contains just a few values, almost every data point is `NA`. I think this causes the problem because those observations are deleted. – Martin Gal Oct 20 '21 at 22:50
  • I can't replicate the problem with the data you provided because it requires registration. The question should stand on it's own with out external links. It works fine in this simple example: `set.seed(17); dd <- as.data.frame(setNames(replicate(10, rnorm(55), simplify = FALSE), paste0("col", 1:10)))` where both `lm(col1~col2 + col3 + col4 + col6 + col7 + col8 + col9, dd)` and `lm(col1~. - col5 - col10, dd)` return the same result – MrFlick Oct 20 '21 at 22:55
  • @MrFlick Create a column with multiple `NA`s and "subtract" that column. My guess is: The `NA` values cause the deletion of those rows, when `. - column_with_na` is used. – Martin Gal Oct 21 '21 at 10:31

0 Answers0