I'm trying to identify the lowest rate over a range of years for a number of items (ID). In addition, I would like to know the Year the lowest rate was pulled from. I'm grouping by ID, but I run into an issue when rates are duplicated across years.
sample data
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Year = rep(2010:2012,4),
Rate = c(0.3,0.6,0.9,
0.8,0.5,0.2,
0.8,0.4,0.9,
0.7,0.7,0.7))
sample data as table
| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |
Using dplyr I grouped by ID, then found the lowest rate.
df.Summarise <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate))
This gives me the following
| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |
However, I also need to know the year that data was pulled from. This is what I would like my final result to look like:
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |
Here's where I ran into some issues.
Attempt #1: Include "Year" in the original dplyr code
df.Summarise2 <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate),
Year = Year)
Error: Column `Year` must be length 1 (a summary value), not 3
Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!
Attempt #2: Use mutate instead of summarise
df.Mutate <- df %>%
group_by(ID) %>%
mutate(LowestRate = min(Rate))
So that essentially returns my original dataframe, but with an extra column for LowestRate attached. How would I go from this to what I want?
I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?
df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))
df.joined as table
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 |
| 4 | 0.7 | 2011 |
| 4 | 0.7 | 2012 |
I've tried looking online, but I can't really find anything that strikes this exactly. Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?
The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)
Thanks for any help!