Linear regression on subsets with dependent variable per column using dlply() in R

Question

I would like to automatically produce linear regressions for a data frame for each category separately.

My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.

head(slope)
   timepoint   Abs      In1      In2      In3     Out1     Out2     Out3 ...
1:        t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2:        t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3:        t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...

All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column. Of course I can do this manually, but I guess there must be a more elegant way to do the work.

I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.

So I somehow tried to combine the answers from previous questions I have found: On individual variables per column and on subsets per category

I came up with a function like this:

lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )

But I get the following error:

Error in eval.quoted(.variables, data) : 
   envir must be either NULL, a list, or an environment.

Hope someone can help me out.

Thanks a lot in advance!

Could you be more specific? One of the problems I am having reading this is understanding if you want simple linear regression for each column or multiple linear regression for the entire subset t0, t1, .... — akash87, Jun 09 '16 at 11:48
I'm sorry for the confusion. I'm trying to do a simple linear regression for each column. — neko, Jun 10 '16 at 11:00
Do you just want the slope, the intercept, etc. What exactly do you want? Running the `lm()` function summarizes a lot of information. — akash87, Jun 10 '16 at 13:15
Primarily I want the coefficient from each linear regeression for further calculation. However, it would be very helpful not only to extract the coefficients, but also to have the models anyway. In the case I want to plot the regressions to see how good the values fit. — neko, Jun 11 '16 at 21:17

score 1 · Answer 1 · answered Jun 13 '16 at 22:32

The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.

Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.

Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.

To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:

df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))

Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use

do(head(data.frame(cor(.[,c(2:40)])), n = 1)))

since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).

To get standard deviation and means for each group, each variable, use

sds     <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means   <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))

Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.

rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]

Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.

sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))

I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.

Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is

slopes    <- sdst*corrs
inter     <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))

Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.

Sorry for my long silence, I have tried your answer but unfortunately I had some errors so I decided to switch to a different approach. But thank you anyway for trying so hard to help me! — neko, May 16 '17 at 08:45
I actually did already when you initially posted it, and now I did it again. But it says votes of people than reputation 15 are recorded but not shown...? So I guess it's my problem, that I am new to this community. — neko, May 18 '17 at 06:49

score 0 · Accepted Answer · answered May 16 '17 at 08:42

I have solved the issue with a simpler approach, so I wanted to update the answer.

To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.

melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")

The output's column name is by default "value".

Then create one column that adds both predictors with paste().

slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")

Run a function through the dataset to create separate models for each treatment combination.

models <- dlply(slope, ~ FullTreat, function(df) { 
          lm(value ~ Abs, data = df)
          })

To extract the coefficents simply run

coefs <- ldply(models, coef)

Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:

coefs <- cbind(colsplit(coefs$FullTreat, split="_",
         c("Sites","Timepoint")), coefs[,2:3])

I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.

Linear regression on subsets with dependent variable per column using dlply() in R

2 Answers2