Error in R Subsetting data frame and then using sapply

Question

I'm attempting to run a regression (lm) over groups of data (counties) in a data frame. However, i first want to filter that data frame (dat) to exclude some groups with too few data points. I get get everything to work fine as long as i don't subset the data frame first:

tmp1 <- with(dat, 
    by(dat, County,
        function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp1, function(x) summary(x)$adj.r.squared)

i get back as expected:

Barrow Carroll Cherokee Clayton Cobb Dekalb Douglas

0.00000 NaN 0.61952 0.69591 0.48092 0.61292 0.39335

However, when i first subset the data frame:

dat.counties <- aggregate(dat[,"County"], by=list(County), FUN=length)
good.counties <- as.matrix(subset(dat.counties, x > 20, select=Group.1))
dat.temp <- dat["County" %in% good.counties,]

and then run the same code:

tmp2 <- with(dat, 
by(dat, County,
    function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

i get the following error: " $ operator is invalid for atomic vectors". If i then run summary(tmp2) I see the following:

     Length Class  Mode
Barrow 0 -none- NULL

Carroll 0 -none- NULL

Cherokee 12 lm list

Clayton 12 lm list

the sapply is obviously bombing out on the Class -none- objects. But those are specifically the ones i had excluded above! How are they still showing up in my new data frame?!

Thank you for any enlightenment.

Please make your question **[reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)**. It is much easier to help when you do that. Note that you're using the wrong `dat` in your second code piece (should be `dat.temp`), but i don't think that is the problem. — BrodieG, Dec 16 '14 at 00:26

akrun · Accepted Answer · 2014-12-16T04:20:26.367

Some parts of the code is not clear. May be you did attach the dataset. Also, there is the problem of using wrong dat instead of dat.temp as commented by @BrodieG. Regarding the error, it could be because the column County is factor and the levels were not dropped. You could try

dat.temp1 <- droplevels(dat.temp)
tmp2 <- with(dat.temp1, 
      by(dat.temp1, County,
      function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

Here is an example that reproduces the error

set.seed(24)
d <- data.frame(
 state = rep(c('NY', 'CA','MD', 'ND'), c(10,10,6,7)),
 year = sample(1:10,33,replace=TRUE),
 response= rnorm(33)
)

 tmp1 <- with(d, by(d, state, function(x) lm(formula=response~year, data=x)))
 sapply(tmp1, function(x) summary(x)$adj.r.squared)
 #       CA          MD          ND          NY 
 # 0.03701114 -0.04988296 -0.07817515 -0.11850038 

d.states <- aggregate(d[,"state"], by=list(d[,'state']), FUN=length)
good.states <- as.matrix(subset(d.states, x > 6, select=Group.1))
d.sub <-  d[d$state %in% good.states[,1],]

tmp2 <- with(d.sub, 
    by(d.sub, state,
      function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#Error in summary(x)$adj.r.squared : 
# $ operator is invalid for atomic vectors

If you look at

 tmp2[2]
 #$MD
 #NULL

d.sub1 <- droplevels(d.sub)
tmp2 <- with(d.sub1, 
      by(d.sub1, state,
          function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#       CA          ND          NY 
# 0.03701114 -0.07817515 -0.11850038

thank you akrun. droplevels did the trick. I had indeed attached the dat dataset previously, but then i had detached it, and thought that would have reverted everything back. And yes, sorry about the wrong "dat.temp" as @BrodieG pointed out. — Michael Ludwig, Dec 16 '14 at 14:57

Error in R Subsetting data frame and then using sapply

1 Answers1