1

I've imported one large table from a SQL database with similar structure to this example table

testData <- data.frame(
  BatchNo = c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3),
  Y = c(1,1.247011378,1.340630851,1.319026357,1.41264583,1.093619473,1.38023909,1.473858563,1,1.093619473,1.038888089,1.081833061,1,1.215913383,1.278861891,1.297746443,1.360694952,1.332368123,1.414201183,1,1.081833061,1,1.063661202),
  Categorical1 = c("A9","B5513","B5513","B5514","B5514","A9","B5514","B5514","A9","A9","B1723","A9","A9","B5513","B5514","B5513","B5514","B5514","B5514","A9","A9","A486","B1701"),
  Categorical2 = c("A2793","B5512","B5512","B5512","B5512","B5508","B6623","B6623","B5508","B5508","B5508","A127","A127","B5515","B5515","B5515","B5515","B6623","B6623","A127","A127","A2727","A2727"),
  Categorical3 = c("A5510","B5511","B5511","B5511","B5511","A5510","B5511","B5511","B5511","B5511","B5511","A5518","A5518","B5517","B5517","B5517","B5517","B5517","B5517","B5517","B5517","A2","A2"),
  Categorical4 = c("A5","A5","B649","A5","B649","B649","A5","B649","A5","B649","A5","B649","A5","A5","A5","B649","B649","A5","B649","A5","B649","A649","A649"),
  Binary1 = c(rep(0,times=23)),
  Binary2 = c(rep(0,times=23)),
  Binary3 = c(rep(0,times=23)),
  Binary4 = c(rep(0,times=23))
  )

What I'd like to do in a for loop is to:

1.Create subset data frames based on the BatchNo column (1 to 2500)
2.Fit linear models using each subset data frame
3.Export the list of coefficient estimates back to a SQL table

I've got the following so far for steps 1 & 2:

n<-max(testData[,1])
for (i in 1:n) {
assign(paste("dat"),droplevels(subset(testData,BatchNo == i, select = 1:10)))
assign(paste("lm.", i, sep =
""),lm(Y~Categorical1+Categorical2+Categorical3+Categorical4+Binary1+Binary2+Binary3+Binary4,data=dat))}

The problem is that there will be subsets created where at least one of the 4 Categorical variables (or maybe all of them) will have a single level (like BatchNo = 3 in this example) and R cannot use those in regression. It is not a problem for the binary predictors as it only results in a N/A coefficient estimate, and I'll do a step(backward) to remove any of those after the models have been fitted.

At first I tried to use step(forward) to select only meaningful predictors in each loop, but that didn't work as I had to list all potential predictors for selection.

I can think of 2 possible solutions:

  • Either drop single-level factor columns from "dat" in each loop
  • Or create a vector/list of the multi-level factor names each loop and use that somehow in the lm formula

I've only got to the point of creating these 2 vectors:

factors<-dat[,3:6]
f<-names(factors)
levels<-c(length(levels(factors[,1])),length(levels(factors[,2])),length(levels(factors[,3])),length(levels(factors[,4])))

So now I just had to drop the nth element from "f" where the nth element of "levels" equals 1.

tattila23
  • 41
  • 4
  • Please do not put data as a picture... [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) – zx8754 Nov 27 '15 at 11:15
  • For steps 1 - use `split()` based on BatchNo, the result is a list, push this list into `lapply()`, step 2 - use `droplevels()` and `lm()` within `lapply()`. – zx8754 Nov 27 '15 at 12:26
  • Hi @zx8745, thanks for you comment. However, isn't it for replacing the **for loop** rather than my question here? – tattila23 Nov 30 '15 at 16:00
  • Share your solution as an answer below, it will make your post clearer, and maybe we can offer improvement. – zx8754 Nov 30 '15 at 16:04

2 Answers2

1

Eventually I've been able to find a way to do what I intended to. There might be a simpler/more elegant way, but I've used:

  l<-nrow(dat)
  a<-length(levels(dat[,3]))
  b<-length(levels(dat[,4]))
  c<-length(levels(dat[,5]))
  d<-length(levels(dat[,6]))
  zeros<-c(rep(0,times=l))
  if (a<2) dat[,2]<-zeros
  if (b<2) dat[,3]<-zeros
  if (c<2) dat[,4]<-zeros
  if (d<2) dat[,5]<-zeros

The single-level factors are replaced with an appropriate length of vectors containing zeros each loop, hence the regressions can be run without getting an error.

tattila23
  • 41
  • 4
1

Try this:

do.call(rbind,
        lapply(split(testData, testData$BatchNo), function(i){
          #check if factor columns have more than 1 level
          cats <- colnames(i)[c(3:6)[sapply(i[, c(3:6)], function(j){length(unique(j))}) > 1]]
          cats <- paste(cats, collapse = "+")
          fit <- lm(as.formula(paste0("Y~", cats, "+Binary2+Binary3+Binary4")), data = i)
          #return coef as df
          as.data.frame(coef(fit))
          })
        )

Output

#                         coef(fit)
# 1.(Intercept)        1.000000e+00
# 1.Categorical1B1723  3.888809e-02
# 1.Categorical1B5513  3.082241e-01
# 1.Categorical1B5514  3.802391e-01
# 1.Categorical2B5508  5.611389e-16
# 1.Categorical2B5512 -6.121273e-02
# 1.Categorical2B6623            NA
# 1.Categorical3B5511  1.699675e-17
# 1.Categorical4B649   9.361947e-02
# 1.Binary2                      NA
# 1.Binary3                      NA
# 1.Binary4                      NA
# 2.(Intercept)        1.000000e+00
# 2.Categorical1B5513  2.694196e-01
# 2.Categorical1B5514  3.323681e-01
# 2.Categorical2B5515 -5.350623e-02
# 2.Categorical2B6623            NA
# 2.Categorical3B5517  3.289161e-18
# 2.Categorical4B649   8.183306e-02
# 2.Binary2                      NA
# 2.Binary3                      NA
# 2.Binary4                      NA
# 3.(Intercept)        1.000000e+00
# 3.Categorical1B1701  6.366120e-02
# 3.Binary2                      NA
# 3.Binary3                      NA
# 3.Binary4                      NA
zx8754
  • 52,746
  • 12
  • 114
  • 209