Subset panel data by group

Question

I would like to subset an unbalanced panel data set by group. For each group, I would like to keep the two observations in the first and the last years.

How do I best do this in R? For example:

dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), 
                 year=c(2001:2003,2000,2002,2000:2001,2003))

> dt
  name year
1    A 2001
2    A 2002
3    A 2003
4    B 2000
5    B 2002
6    C 2000
7    C 2001
8    C 2003

What I would like to have:

  name year
1    A 2001
3    A 2003
4    B 2000
5    B 2002
6    C 2000
8    C 2003

bill-felix · Answer 1 · 2015-03-31T19:55:27.760

2

dplyr should help. check out first() & last() to get the values you are looking for and then filter based on those values.

dt <- data.frame(name= rep(c("A", "B", "C"), c(3,2,3)), year=c(2001:2003,2000,2002,2000:2001,2003))

library(dplyr)

dt %>%
  group_by(name) %>%
  mutate(first = first(year)
        ,last = last(year)) %>%
  filter(year == first | year == last) %>%
  select(name, year)

  name year
1    A 2001
2    A 2003
3    B 2000
4    B 2002
5    C 2000
6    C 2003

*your example to didn't mention any specific order but it that case, arrange() will help

edited Mar 31 '15 at 19:55

answered Mar 31 '15 at 19:46

bill-felix

173
1
7

Thanks. But it seems dplyr is not available anymore. – Gary Mar 31 '15 at 19:50
1

@Gary: ??? http://cran.r-project.org/web/packages/dplyr/index.html Much more likely that you have an old version of R that's not seeing this package ... – Ben Bolker Mar 31 '15 at 20:54
1

@BenBolker my guess is that OP doesn't know he can install external packages, e.g. using `install.packages("dplyr")` – David Arenburg Apr 01 '15 at 07:53

score 2 · Answer 2 · answered Mar 31 '15 at 22:17

Here's a quick possible data.table solution

library(data.table)
setDT(dt)[, .SD[c(1L, .N)], by = name]
#    name year
# 1:    A 2001
# 2:    A 2003
# 3:    B 2000
# 4:    B 2002
# 5:    C 2000
# 6:    C 2003

Or if you only have two columns

dt[, year[c(1L, .N)], by = name]

Thomas · Answer 3 · 2015-04-01T05:56:13.730

1

This is pretty simple using by to split the data.frame by group and then return the head and tail of each group.

> do.call(rbind, by(dt, dt$name, function(x) rbind(head(x,1),tail(x,1))))
    name year
A.1    A 2001
A.3    A 2003
B.4    B 2000
B.5    B 2002
C.6    C 2000
C.8    C 2003

head and tail are convenient, but slow, so a slightly different alternative would probably be faster on a large data.frame:

do.call(rbind, by(dt, dt$name, function(x) x[c(1,nrow(x)),]))

edited Apr 01 '15 at 05:56

answered Mar 31 '15 at 19:46

Thomas

43,637
12
109
140

1

`dt[as.logical(ave(dt$year,dt$name,FUN=function(x) seq_along(x) %in% c(1,length(x)))),]` as an alternative using `ave` – thelatemail Apr 01 '15 at 00:09

Subset panel data by group

3 Answers3

Linked

Related