6

I have a dataset (data frame) with 5 columns all containing numeric values.

I'm looking to run a simple linear regression for each pair in the dataset.

For example, If the columns were named A, B, C, D, E, I want to run lm(A~B), lm(A~C), lm(A~D), ...., lm(D~E),... and, then I want to plot the data for each pair along with the regression line.

I'm pretty new to R so I'm sort of spinning my wheels on how to actually accomplish this. Should I use ddply? or lapply? I'm not really sure how to tackle this.

Metrics
  • 15,172
  • 7
  • 54
  • 83
mrp
  • 689
  • 2
  • 11
  • 28
  • Welcome to the SO. You can use `combn` of cols and `apply` functions for your problem – Metrics Sep 22 '13 at 18:59
  • 1
    Please include a minimal, [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Thanks. – Henrik Sep 22 '13 at 19:03
  • 1
    Sorry, I was trying to include data along with code but it wasn't showing up that it would post properly (i'm also new to SO). – mrp Sep 22 '13 at 19:08
  • 1
    you can probably use the `ggpairs` function from the `GGally` package: http://www.r-bloggers.com/example-9-17-much-better-pairs-plots/ – Ben Bolker Sep 22 '13 at 19:09
  • I'm also curious whether there's a generic `pairapply` function out there somewhere -- not that I can find ... although the `combn` solution below basically does it. – Ben Bolker Sep 22 '13 at 19:48
  • @ Ben Bolker, I'm not familiar w/ GGally, I will check it out though. thank you. – mrp Sep 22 '13 at 20:43

3 Answers3

7

Here's one solution using combn

 combn(names(DF), 2, function(x){lm(DF[, x])}, simplify = FALSE)

Example:

set.seed(1)
DF <- data.frame(A=rnorm(50, 100, 3),
                 B=rnorm(50, 100, 3),
                 C=rnorm(50, 100, 3),
                 D=rnorm(50, 100, 3),
                 E=rnorm(50, 100, 3))

Updated: adding @Henrik suggestion (see comments)

# only the coefficients
> results <- combn(names(DF), 2, function(x){coefficients(lm(DF[, x]))}, simplify = FALSE)
> vars <- combn(names(DF), 2)
> names(results) <- vars[1 , ] # adding names to identify variables in the reggression
> results
$A
 (Intercept)            B 
103.66739418  -0.03354243 

$A
(Intercept)           C 
97.88341555  0.02429041 

$A
(Intercept)           D 
122.7606103  -0.2240759 

$A
(Intercept)           E 
99.26387487  0.01038445 

$B
 (Intercept)            C 
99.971253525  0.003824755 

$B
 (Intercept)            D 
102.65399702  -0.02296721 

$B
(Intercept)           E 
96.83042199  0.03524868 

$C
(Intercept)           D 
 80.1872211   0.1931079 

$C
(Intercept)           E 
 89.0503893   0.1050202 

$D
 (Intercept)            E 
107.84384655  -0.07620397 
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • 2
    Perhaps add the response variable as name to each list element? Something like `vars <- combn(names(DF), 2)`; `names(the-coef-list) <- vars[1 , ]`. Maybe. – Henrik Sep 22 '13 at 19:37
  • This is awesome, thank you very much. I was able to use the same combn() code to plot each regression: combn(names(DF), 2, function(x){coefficients(plot(DF[, x]))}, simplify = FALSE). How would I add in the regression line to each plot though? Can I nest a lines() function in there somewhere? I was tinkering around with the formula but i kept getting errors. Or perhaps run a separate abline()? Thanks again! – mrp Sep 22 '13 at 20:37
2

I would recommend to also look at the correlation matrix (cor(DF)), which is usually the best way to discover linear relationships between variables. The correlation is tightly linked to the covariance and the slopes of a simple linear regression. The computation below exemplifies this link.

Sample data:

set.seed(1)
DF <- data.frame(
  A=rnorm(50, 100, 3),
  B=rnorm(50, 100, 3),
  C=rnorm(50, 100, 3),
  D=rnorm(50, 100, 3),
  E=rnorm(50, 100, 3)
)

The regression slope is cov(x, y) / var(x)

beta = cov(DF) * (1/diag(var(DF)))

            A            B           C           D           E
A  1.00000000 -0.045548503 0.028448192 -0.32982367  0.01800795
B -0.03354243  1.000000000 0.003298708 -0.02489518  0.04501362
C  0.02429041  0.003824755 1.000000000  0.24269838  0.15550116
D -0.22407592 -0.022967212 0.193107904  1.00000000 -0.08977834
E  0.01038445  0.035248685 0.105020194 -0.07620397  1.00000000

The intercept is mean(y) - beta * mean(x)

colMeans(DF) - beta * colMeans(DF)

             A         B         C         D         E
A 1.421085e-14 104.86992  97.44795 133.38310  98.49512
B 1.037180e+02   0.00000 100.02095 102.85026  95.83477
C 9.712461e+01  99.16182   0.00000  75.38373  84.06356
D 1.226899e+02 102.53263  80.87529   0.00000 109.22915
E 9.886859e+01  96.38451  89.41391 107.51930   0.00000
unique2
  • 2,162
  • 2
  • 18
  • 23
1

Using combn for all combination of names of column (in the following example I assumed you want combination of two columns only) and Map for running over loops.

Example using mtcars data from R:

colc<-names(mtcars)
colcc<-combn(colc,2)
colcc<-data.frame(colcc)
kk<-Map(function(x)lm(as.formula(paste(colcc[1,x],"~",paste(colcc[2,x],collapse="+"))),data=mtcars), as.list(1:nrow(colcc)))

 head(kk)
[[1]]

Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2, 
    x], collapse = "+"))), data = mtcars)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876  


[[2]]

Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2, 
    x], collapse = "+"))), data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122  


[[3]]

Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2, 
    x], collapse = "+"))), data = mtcars)

Coefficients:
(Intercept)           hp  
   30.09886     -0.06823  


[[4]]

Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2, 
    x], collapse = "+"))), data = mtcars)

Coefficients:
(Intercept)         drat  
     -7.525        7.678  


[[5]]

Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2, 
    x], collapse = "+"))), data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  


[[6]]

Call:
lm(formula = as.formula(paste(colcc[1, x], "~", paste(colcc[2, 
    x], collapse = "+"))), data = mtcars)

Coefficients:
(Intercept)         qsec  
     -5.114        1.412  
Metrics
  • 15,172
  • 7
  • 54
  • 83