How to use subset to get only the first xx observations from the data set in R?

Question

If I have a data set containing 4137 observations and I want to do a regression of colga on hsperc and sat using only the first 2070 observations, how do I do?

I have tried something like:

# loading data
GPA2 <- read.table("GPA2.raw", header=TRUE, na.strings=".")

# fitting model 
mfit1 <- lm( formula = colgpa ~ hsperc + sat, data=GPA2, subset=(rownum<2071) )

But the subset using rownum fails. Any suggestions??

I don't have a variable that counts the number of rows, shoud I have that? In that case, how do i do that?

You could use `data = GPA2[1:2070, ]` and leave out the `subset` argument. — mark999, Oct 18 '12 at 09:29
I would agree with the above. It's best to avoid subset if possible. In Hadley's online materials, he has examples and explanations as to why. — chandler, Oct 18 '12 at 09:43

score 5 · Accepted Answer · answered Oct 18 '12 at 10:09

5

A simple, reproducible example:

dat = data.frame(A = runif(100), B = runif(100))
lm(A~B, dat)

This fails, as you found out:

> lm(A~B, dat, subset = (rownum < 50))
Error in eval(expr, envir, enclos) : object 'rownum' not found

that is because there is no rownum column in your data. There are two solutions:

Add a rownum column:
```
dat[["rownum"]] = 1:nrow(dat)
```
Or perform the subset operation before the analysis:
```
dat_subset = dat[1:2070,]
lm(A~B, dat_subset)
```

As the commenters mentioned, going for option 2 is probably best.

answered Oct 18 '12 at 10:09

Paul Hiemstra

59,984
12
142
149

Wow, thanks for all the answers! I will try to use your suggestions!! I don't understand what is so "dangerous" with using subset? – user1626092 Oct 18 '12 at 10:46
1

Using `head(dat, 2070)` would be another good solution. Later, if the number of rows to use comes as a variable, `head(dat, x)` will do the "right" thing if `x > nrow(dat)`: it will return the full `dat`, while `dat[1:x,]` will error out. – flodel Oct 18 '12 at 11:50
... or rather, it would fill the bottom of `dat` with `NA`s (another bad outcome) – flodel Oct 18 '12 at 11:58
Why subset is dangerous is explained in great details in this [post by Hadley](https://github.com/hadley/devtools/wiki/Evaluation) and summarized in this [SO question](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset) – flodel Oct 18 '12 at 11:59

How to use subset to get only the first xx observations from the data set in R?

1 Answers1