-2

I'm attempting to run a regression (lm) over groups of data (counties) in a data frame. However, i first want to filter that data frame (dat) to exclude some groups with too few data points. I get get everything to work fine as long as i don't subset the data frame first:

tmp1 <- with(dat, 
    by(dat, County,
        function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp1, function(x) summary(x)$adj.r.squared)

i get back as expected:

Barrow Carroll Cherokee Clayton Cobb Dekalb Douglas

0.00000 NaN 0.61952 0.69591 0.48092 0.61292 0.39335

However, when i first subset the data frame:

dat.counties <- aggregate(dat[,"County"], by=list(County), FUN=length)
good.counties <- as.matrix(subset(dat.counties, x > 20, select=Group.1))
dat.temp <- dat["County" %in% good.counties,]

and then run the same code:

tmp2 <- with(dat, 
by(dat, County,
    function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

i get the following error: " $ operator is invalid for atomic vectors". If i then run summary(tmp2) I see the following:

     Length Class  Mode

Barrow 0 -none- NULL

Carroll 0 -none- NULL

Cherokee 12 lm list

Clayton 12 lm list

the sapply is obviously bombing out on the Class -none- objects. But those are specifically the ones i had excluded above! How are they still showing up in my new data frame?!

Thank you for any enlightenment.

Community
  • 1
  • 1
  • 1
    Please make your question **[reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)**. It is much easier to help when you do that. Note that you're using the wrong `dat` in your second code piece (should be `dat.temp`), but i don't think that is the problem. – BrodieG Dec 16 '14 at 00:26

1 Answers1

1

Some parts of the code is not clear. May be you did attach the dataset. Also, there is the problem of using wrong dat instead of dat.temp as commented by @BrodieG. Regarding the error, it could be because the column County is factor and the levels were not dropped. You could try

dat.temp1 <- droplevels(dat.temp)
tmp2 <- with(dat.temp1, 
      by(dat.temp1, County,
      function(x) lm(formula = Y ~ A + B + C, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)

Here is an example that reproduces the error

set.seed(24)
d <- data.frame(
 state = rep(c('NY', 'CA','MD', 'ND'), c(10,10,6,7)),
 year = sample(1:10,33,replace=TRUE),
 response= rnorm(33)
)

 tmp1 <- with(d, by(d, state, function(x) lm(formula=response~year, data=x)))
 sapply(tmp1, function(x) summary(x)$adj.r.squared)
 #       CA          MD          ND          NY 
 # 0.03701114 -0.04988296 -0.07817515 -0.11850038 

d.states <- aggregate(d[,"state"], by=list(d[,'state']), FUN=length)
good.states <- as.matrix(subset(d.states, x > 6, select=Group.1))
d.sub <-  d[d$state %in% good.states[,1],]

tmp2 <- with(d.sub, 
    by(d.sub, state,
      function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#Error in summary(x)$adj.r.squared : 
# $ operator is invalid for atomic vectors

If you look at

 tmp2[2]
 #$MD
 #NULL

d.sub1 <- droplevels(d.sub)
tmp2 <- with(d.sub1, 
      by(d.sub1, state,
          function(x) lm(formula = response~year, data=x)))
sapply(tmp2, function(x) summary(x)$adj.r.squared)
#       CA          ND          NY 
# 0.03701114 -0.07817515 -0.11850038 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thank you akrun. droplevels did the trick. I had indeed attached the dat dataset previously, but then i had detached it, and thought that would have reverted everything back. And yes, sorry about the wrong "dat.temp" as @BrodieG pointed out. – Michael Ludwig Dec 16 '14 at 14:57