1

If I have a data set containing 4137 observations and I want to do a regression of colga on hsperc and sat using only the first 2070 observations, how do I do?

I have tried something like:

# loading data
GPA2 <- read.table("GPA2.raw", header=TRUE, na.strings=".")

# fitting model 
mfit1 <- lm( formula = colgpa ~ hsperc + sat, data=GPA2, subset=(rownum<2071) )

But the subset using rownum fails. Any suggestions??

I don't have a variable that counts the number of rows, shoud I have that? In that case, how do i do that?

user1626092
  • 499
  • 4
  • 11
  • 23
  • 4
    You could use `data = GPA2[1:2070, ]` and leave out the `subset` argument. – mark999 Oct 18 '12 at 09:29
  • I would agree with the above. It's best to avoid subset if possible. In Hadley's online materials, he has examples and explanations as to why. – chandler Oct 18 '12 at 09:43

1 Answers1

5

A simple, reproducible example:

dat = data.frame(A = runif(100), B = runif(100))
lm(A~B, dat)

This fails, as you found out:

> lm(A~B, dat, subset = (rownum < 50))
Error in eval(expr, envir, enclos) : object 'rownum' not found

that is because there is no rownum column in your data. There are two solutions:

  1. Add a rownum column:

    dat[["rownum"]] = 1:nrow(dat)
    
  2. Or perform the subset operation before the analysis:

    dat_subset = dat[1:2070,]
    lm(A~B, dat_subset)
    

As the commenters mentioned, going for option 2 is probably best.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Wow, thanks for all the answers! I will try to use your suggestions!! I don't understand what is so "dangerous" with using subset? – user1626092 Oct 18 '12 at 10:46
  • 1
    Using `head(dat, 2070)` would be another good solution. Later, if the number of rows to use comes as a variable, `head(dat, x)` will do the "right" thing if `x > nrow(dat)`: it will return the full `dat`, while `dat[1:x,]` will error out. – flodel Oct 18 '12 at 11:50
  • ... or rather, it would fill the bottom of `dat` with `NA`s (another bad outcome) – flodel Oct 18 '12 at 11:58
  • Why subset is dangerous is explained in great details in this [post by Hadley](https://github.com/hadley/devtools/wiki/Evaluation) and summarized in this [SO question](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset) – flodel Oct 18 '12 at 11:59