I would like to automate a simple multiple regression for the subsets defined by the unique combinations of the grouping variables. I have a dataframe with several grouping variables df1[,1:6] and some independent variables df1[,8:10] and a response df1[,7].
This is an excerpt from the data.
structure(list(Surface = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("NiAu", "Sn"), class = "factor"), Supplier = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), ParticleSize = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("3", "5"), class = "factor"), T1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("130", "144"), class = "factor"), T2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "200", class = "factor"), O2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1300", class = "factor"), Shear = c(56.83, 67.73, 78.51, 62.61, 66.78, 60.89, 62.94, 76.34, 70.56, 70.4, 54.15), Gap = c(373, 450, 417, 450, 406, 439, 439, 417, 439, 441, 417), Clearance = c(500.13, 509.85, 495.97, 499.55, 502.66, 505.33, 500.32, 503.28, 507.44, 500.5, 498.39), Void = c(316, 343, 89, 247, 271, 326, 304, 282, 437, 243, 116)), .Names = c("Surface", "Supplier", "ParticleSize","T1", "T2", "O2", "Shear", "Gap", "Clearance", "Void"), class = "data.frame", row.names = c(NA, -11L))
Using unique(df1[,1:6]) returns 5 factor combinations of the grouping variables. So there should be 5 subsets where I apply the lm() function to. My call looks like that
df1.fit.by<-with(df1,by(df1,df1[,1:6], function(x) lm(Shear~Gap+Clearance+Void,data=x)))
sapply(df1.fit.by,coef)
Problem 1: it returns a list with 16 list entries. Apparently, it calculates all possible factor combinations of the first six grouping variables. (V5+V6 only have on level but V1:4 have two levels level in the excerpt. Resulting in 2^4=16) But it should only use the real existing factor combinations in the data. So I suppose by() is not the correct function to achieve that. Any suggestions?
Problem 2: I find it easier to refer to column indices rather than variable names. So I was initially trying to use my lm() function in the way lm(df1[,7]~df1[,8]+df1[,9]). That did not work out. Because I always access the entire df1 dataframe instead of the subsets. So probably I should pass the row indeces for the factor combinations to the lm()function rather than a complete dataframe.
I think the solution to problem 1 and 2 are somehow related and solved using another subset function. It would be nice if someone can try to explain where my mistake is. If its possible I would stick to the standard packages simply because I want to improve my understanding of R. Thanks
EDIT: a minor mistake in the variable assignment