0

I have a data frame of vegetation metrics collected at x units and y sampling stations (multiple stations within each unit) over multiple years. I want to select all the vegetation data for each unit for the most recent year that data has been collected. Here is an example of my data frame:

veg <- c("tree","grass","tree","grass","tree","grass","tree","grass")
cover <- c(0.97,0.21,0.35,0.67,0.45,0.72,0.27,0.67)
unit <- c("U1","U1","U1","U1","U2","U2","U2","U2")
station <- c("A1","A1","A2","A2","A3","A3","A4","A4")
year <- c(2015,2015,2014,2014,2013,2013,2014,2014)
df <- data.frame(veg,cover,unit,station,year)

The data frame looks like this:

    veg cover unit station year
1  tree  0.97   U1      A1 2015
2 grass  0.21   U1      A1 2015
3  tree  0.35   U1      A2 2014
4 grass  0.67   U1      A2 2014
5  tree  0.45   U2      A3 2013
6 grass  0.72   U2      A3 2013
7  tree  0.27   U2      A4 2014
8 grass  0.67   U2      A4 2014

I want it to look like this:

    veg cover unit station year
1  tree  0.97   U1      A1 2015
2 grass  0.21   U1      A1 2015
3  tree  0.27   U2      A4 2014
4 grass  0.67   U2      A4 2014

Any help would be much appreciated.

Jaap
  • 81,064
  • 34
  • 182
  • 193
omwrichmond
  • 109
  • 7

2 Answers2

0

This gets your answer, you want the most recent by veg/unit right?

library(dplyr)
df %>% 
    group_by(veg, unit) %>% 
    arrange(desc(year)) %>% 
    slice(1)
0

It is how to do it without any package.

df.by     = by(df, df$unit, FUN = function(t) t[t$year == max(t$year),])
df.recent = Reduce(function(...) merge(..., all=T), df.by)
df.recent

The output is

>     df.recent
    veg cover unit station year
1 grass  0.21   U1      A1 2015
2 grass  0.67   U2      A4 2014
3  tree  0.27   U2      A4 2014
4  tree  0.97   U1      A1 2015

For the first line, we use the function by to subset the data frame by the factor df$unit. For each subset (for each unit) , we extract the row of the recent year by the anonymous function function(t) t[t$year == max(t$year),]).

df.by is a list of data frames which contains only the rows of most recent year for each unit.

For the second line, we use the the merge function to merge all the data frame in df.by. The use of this code is explain in Simultaneously merge multiple data.frames in a list .

Community
  • 1
  • 1
Po C.
  • 678
  • 6
  • 14