Filter first date per year with several columns

Question

Been looking for a while without finding answers so try here:

I have a group of data in a column where the first observation of an animal is listed. 2022-05-03. 2022-05-01. 2022-04-23, 2021-05-04, 2021-02-31, 2020-01-30, 2020-05-20 and so on.

I am looking for finding the first observation per year using the filter() function. How is that supposed to like, is the lubridate function something to apply?

Thanks in advance.

Please, provide a minimal reproducible example: [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — PaulS, Jul 14 '22 at 20:02
Consider sth like `x <- c('2022-05-03', '2022-05-01', '2022-04-23', '2021-05-04', '2021-02-31', '2020-01-30', '2020-05-20'); x[which.min(as.Date(x))]` — jay.sf, Jul 14 '22 at 20:10
Thanks @jay.sf. If my data is named "Animal_data" and the column "date". How would that code look like? I'm a beginner so sorry if I don't mind asking. — Joseph Carthof, Jul 14 '22 at 20:11
@JosephCarthof I elaborate on that in [my answer below](https://stackoverflow.com/a/72986330/6574038). — jay.sf, Jul 14 '22 at 20:38

score 0 · Answer 1 · answered Jul 14 '22 at 20:12

0

Yoy can try:

library(dplyr)
library(lubridate)
df = tibble(date = as.Date(c("2022-05-03", "2022-05-01", "2022-04-23", "2021-05-04", "2021-02-28", "2020-01-30", "2020-05-20")))

Then, to get the first date by year:

df %>% mutate(year = year(date)) %>% arrange(date) %>% group_by(year) %>% slice(1)

Best wishes!

answered Jul 14 '22 at 20:12

Diego Rojas

199
6

Thanks! If my data is namned "Animal_data" and the column with dates "date". How would that look like? – Joseph Carthof Jul 14 '22 at 20:25
If you use the `animal_data` dataframe defined by @jay.sf, you can use the `dplyr` code as follow: `animal_data %>% mutate(year = year(date)) %>% arrange(date) %>% group_by(year, animals) %>% slice(1) %>%ungroup() %>% select(-year) ` – Diego Rojas Jul 14 '22 at 21:24

jay.sf · Answer 2 · 2022-07-14T20:49:17.070

I show you some ways First of all, use "Date" format for dates!

animal_data <- transform(animal_data, date=as.Date(date))

Here an option using aggregate with formula interface, aggregating at animal name and 1-4 substrings of the date, i.e. the year,

aggregate(date ~ animals + substr(date, 1, 4), animal_data, min)
#   animals substr(date, 1, 4)       date
# 1 Gorilla               2020 2020-07-05
# 2  Rhebok               2020 2020-02-22
# 3  Vicuna               2020 2020-06-23
# 4 Gorilla               2021 2021-01-11
# 5  Rhebok               2021 2021-03-10
# 6  Vicuna               2021 2021-05-24
# 7 Gorilla               2022 2022-05-03
# 8  Rhebok               2022 2022-04-29

or with list notation, where we are most flexible regarding the column names of the result.

with(animal_data, aggregate(list(date=date), list(animals=animals, year=substr(date, 1, 4)), min))
#   animals year       date
# 1 Gorilla 2020 2020-07-05
# 2  Rhebok 2020 2020-02-22
# 3  Vicuna 2020 2020-06-23
# 4 Gorilla 2021 2021-01-11
# 5  Rhebok 2021 2021-03-10
# 6  Vicuna 2021 2021-05-24
# 7 Gorilla 2022 2022-05-03
# 8  Rhebok 2022 2022-04-29

Another way is using ave in subset. subset expects a logical condition. ave internally splits the date at animal then at the year and applies which.max on this subset. We compare the output of ave—the first obs of the animal in that year—with the date and in this way create the logical subset.

subset(animal_data, ave(date, animals, substr(date, 1, 4), FUN=\(x) x[which.min(x)]) == date)
#    animals       date
# 1   Rhebok 2020-02-22
# 2   Vicuna 2020-06-23
# 3  Gorilla 2020-07-05
# 12 Gorilla 2021-01-11
# 13  Rhebok 2021-03-10
# 14  Vicuna 2021-05-24
# 19  Rhebok 2022-04-29
# 20 Gorilla 2022-05-03

Now you probably have a few options to choose from.

Data:

animal_data <- structure(list(animals = c("Rhebok", "Vicuna", "Gorilla", "Rhebok", 
"Rhebok", "Gorilla", "Rhebok", "Vicuna", "Vicuna", "Gorilla", 
"Vicuna", "Gorilla", "Rhebok", "Vicuna", "Rhebok", "Rhebok", 
"Rhebok", "Vicuna", "Rhebok", "Gorilla"), date = structure(c(18314, 
18436, 18448, 18487, 18502, 18516, 18549, 18582, 18588, 18589, 
18604, 18638, 18696, 18771, 18806, 18807, 18911, 18938, 19111, 
19115), class = "Date")), row.names = c(8L, 15L, 9L, 18L, 3L, 
20L, 4L, 14L, 7L, 10L, 5L, 17L, 11L, 6L, 19L, 2L, 1L, 13L, 12L, 
16L), class = "data.frame")

Filter first date per year with several columns

2 Answers2