Aggregating by fixed date range R

Question

Given a simplification of my dataset like:

df <- data.frame("ID"= c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
                  "ForestType" = c("oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak",
                                   "pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine"),
                  "Date"= c("1987.01.01","1987.06.01","1987.10.01","1987.11.01",
                     "1988.01.01","1988.03.01","1988.04.01","1988.06.01",
                     "1989.03.01","1989.05.01","1989.07.01","1989.08.01",
                     "1987.01.01","1987.06.01","1987.10.01","1987.11.01",
                     "1988.01.01","1988.03.01","1988.04.01","1988.06.01",
                     "1989.03.01","1989.05.01","1989.07.01","1989.08.01"),
                    "NDVI"= c(0.1,0.2,0.3,0.55,0.31,0.26,0.34,0.52,0.41,0.45,0.50,0.7,
                     0.2,0.3,0.4,0.53,0.52,0.54,0.78,0.73,0.72,0.71,0.76,0.9),
                      check.names = FALSE, stringsAsFactors = FALSE)

I would like to obtain the means of NDVI values by a certain period of time, in this case by year. Take into account that in my real dataset I would need it for seasons, so it should be adaptable.

These means should consider:

Trimming outliers: for example 25% of the highest values and 25% of the lowest values.
They should be by class, in this case by the ID field.

So the output should look something like:

> desired_df
  ID ForestType Date meanNDVI
1  1        oak 1987    0.250
2  1        oak 1988    0.325
3  1        oak 1989    0.430
4  2       pine 1987    0.350
5  2       pine 1988    0.635
6  2       pine 1989    0.740

In this case, for example, 0.250 corresponds to mean NDVI on 1987 of ID=1 and it is the mean of the 4 values of that year taking the lowest and the highest out.

Thanks a lot!

This would normally be referred to as aggregating, not taking a moving average. — G. Grothendieck, Mar 20 '20 at 12:27

score 1 · Accepted Answer · answered Mar 20 '20 at 11:12

1

library(tidyverse)
library(lubridate)

df %>%
  mutate(Date = as.Date(Date, format = "%Y.%m.%d")) %>%
  group_by(ID, ForestType, Year = year(Date)) %>%
  filter(NDVI > quantile(NDVI, .25) & NDVI < quantile(NDVI, .75)) %>%
  summarise(meanNDVI = mean(NDVI))

Output

# A tibble: 6 x 4
# Groups:   ID, ForestType [2]
     ID ForestType  Year meanNDVI
  <dbl> <chr>      <dbl>    <dbl>
1     1 oak         1987    0.25 
2     1 oak         1988    0.325
3     1 oak         1989    0.475
4     2 pine        1987    0.35 
5     2 pine        1988    0.635
6     2 pine        1989    0.74

answered Mar 20 '20 at 11:12

Ben

28,684
5
23
45

Thanks Ben! That works also fantastically good in my database. How would you adapt the code to group by season instead of year? So something like: 1987_1, 1987_2, 1987_3, 1987_4, 1988_1... I know it's hard to implement in the code I posted. Instead of `Year` something like `year_season`. Thanks again! – Oriol Baena Crespo Mar 20 '20 at 11:40
If you have a column for `year_season` you can then `group_by(ID, ForestType, year_season)` Do you already have a column for `year_season` in place? – Ben Mar 20 '20 at 12:08
Not yet but I guess it is easy to create from the date column! – Oriol Baena Crespo Mar 20 '20 at 12:18
Yes, there are a number of approaches in other posts such as: https://stackoverflow.com/questions/36502140/determine-season-from-date-using-lubridate-in-r ... https://stackoverflow.com/questions/9500114/find-which-season-a-particular-date-belongs-to ... some depend on locale, definition, etc... – Ben Mar 20 '20 at 12:33

score 1 · Answer 2 · answered Mar 20 '20 at 11:22

The classical base R approach using aggregate. The year can be obtained using substr.

res <- with(df, aggregate(list(meanNDVI=NDVI), 
                          by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
                          FUN=mean))
res[order(res$ID), ]
#   ID ForestType date meanNDVI
# 1  1        oak 1987   0.2875
# 3  1        oak 1988   0.3575
# 5  1        oak 1989   0.5150
# 2  2       pine 1987   0.3575
# 4  2       pine 1988   0.6425
# 6  2       pine 1989   0.7725

Trimmed version

Trimmed for 25% outlyers.

res2 <- with(df, aggregate(list(meanNDVI=NDVI), 
                           by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
                           FUN=mean, trim=.25))
res2[order(res2$ID), ]
#   ID ForestType date meanNDVI
# 1  1        oak 1987    0.250
# 3  1        oak 1988    0.325
# 5  1        oak 1989    0.475
# 2  2       pine 1987    0.350
# 4  2       pine 1988    0.635
# 6  2       pine 1989    0.740

score 1 · Answer 3 · answered Mar 20 '20 at 11:43

Using data.table package, you could proceed as follows:

library(data.table)

setDT(df)[, Date := as.Date(Date, format = "%Y.%m.%d")][]
df[, .(meanNDVI = base::mean(NDVI, trim = 0.25)), by = .(ID, ForestType, year = year(Date))]

#       ID ForestType   year meanNDVI
# 1:     1        oak   1987    0.250
# 2:     1        oak   1988    0.325
# 3:     1        oak   1989    0.475
# 4:     2       pine   1987    0.350
# 5:     2       pine   1988    0.635
# 6:     2       pine   1989    0.740

score 1 · Answer 4 · answered Mar 20 '20 at 11:54

1

Another option. You can set trim in mean

library(tidyverse)
library(lubridate)
df %>% 
  mutate(Date = ymd(Date) %>% year()) %>% 
  group_by(ID, ForestType, Date) %>% 
  summarise(mean = mean(NDVI, trim = 0.25, na.rm = T))

answered Mar 20 '20 at 11:54

Yuriy Saraykin

8,390
1
7
14

Aggregating by fixed date range R

4 Answers4