I've imported one large table from a SQL database with similar structure to this example table
testData <- data.frame(
BatchNo = c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3),
Y = c(1,1.247011378,1.340630851,1.319026357,1.41264583,1.093619473,1.38023909,1.473858563,1,1.093619473,1.038888089,1.081833061,1,1.215913383,1.278861891,1.297746443,1.360694952,1.332368123,1.414201183,1,1.081833061,1,1.063661202),
Categorical1 = c("A9","B5513","B5513","B5514","B5514","A9","B5514","B5514","A9","A9","B1723","A9","A9","B5513","B5514","B5513","B5514","B5514","B5514","A9","A9","A486","B1701"),
Categorical2 = c("A2793","B5512","B5512","B5512","B5512","B5508","B6623","B6623","B5508","B5508","B5508","A127","A127","B5515","B5515","B5515","B5515","B6623","B6623","A127","A127","A2727","A2727"),
Categorical3 = c("A5510","B5511","B5511","B5511","B5511","A5510","B5511","B5511","B5511","B5511","B5511","A5518","A5518","B5517","B5517","B5517","B5517","B5517","B5517","B5517","B5517","A2","A2"),
Categorical4 = c("A5","A5","B649","A5","B649","B649","A5","B649","A5","B649","A5","B649","A5","A5","A5","B649","B649","A5","B649","A5","B649","A649","A649"),
Binary1 = c(rep(0,times=23)),
Binary2 = c(rep(0,times=23)),
Binary3 = c(rep(0,times=23)),
Binary4 = c(rep(0,times=23))
)
What I'd like to do in a for loop is to:
1.Create subset data frames based on the BatchNo column (1 to 2500)
2.Fit linear models using each subset data frame
3.Export the list of coefficient estimates back to a SQL table
I've got the following so far for steps 1 & 2:
n<-max(testData[,1])
for (i in 1:n) {
assign(paste("dat"),droplevels(subset(testData,BatchNo == i, select = 1:10)))
assign(paste("lm.", i, sep =
""),lm(Y~Categorical1+Categorical2+Categorical3+Categorical4+Binary1+Binary2+Binary3+Binary4,data=dat))}
The problem is that there will be subsets created where at least one of the 4 Categorical variables (or maybe all of them) will have a single level (like BatchNo = 3 in this example) and R cannot use those in regression.
It is not a problem for the binary predictors as it only results in a N/A
coefficient estimate, and I'll do a step(backward)
to remove any of those after the models have been fitted.
At first I tried to use step(forward)
to select only meaningful predictors in each loop, but that didn't work as I had to list all potential predictors for selection.
I can think of 2 possible solutions:
- Either drop single-level factor columns from "dat" in each loop
- Or create a vector/list of the multi-level factor names each loop and use that somehow in the
lm
formula
I've only got to the point of creating these 2 vectors:
factors<-dat[,3:6]
f<-names(factors)
levels<-c(length(levels(factors[,1])),length(levels(factors[,2])),length(levels(factors[,3])),length(levels(factors[,4])))
So now I just had to drop the nth element from "f" where the nth element of "levels" equals 1.