How to summarize data in R (dplyr) and avoid duplicate identifiers?

Question

I'm trying to identify the lowest rate over a range of years for a number of items (ID). In addition, I would like to know the Year the lowest rate was pulled from. I'm grouping by ID, but I run into an issue when rates are duplicated across years.

sample data

df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4), 
             Year = rep(2010:2012,4),
             Rate = c(0.3,0.6,0.9,
                      0.8,0.5,0.2,
                      0.8,0.4,0.9,
                      0.7,0.7,0.7))

sample data as table

| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |

Using dplyr I grouped by ID, then found the lowest rate.

df.Summarise <- df %>%
   group_by(ID) %>%
   summarise(LowestRate = min(Rate))

This gives me the following

| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |

However, I also need to know the year that data was pulled from. This is what I would like my final result to look like:

| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |

Here's where I ran into some issues.

Attempt #1: Include "Year" in the original dplyr code

df.Summarise2 <- df %>%
  group_by(ID) %>%
  summarise(LowestRate = min(Rate),
            Year = Year)

Error: Column `Year` must be length 1 (a summary value), not 3

Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!

Attempt #2: Use mutate instead of summarise

df.Mutate <- df %>%
  group_by(ID) %>%
  mutate(LowestRate = min(Rate))

So that essentially returns my original dataframe, but with an extra column for LowestRate attached. How would I go from this to what I want?

I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?

df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))

df.joined as table

| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 | 
| 4 | 0.7 | 2011 | 
| 4 | 0.7 | 2012 |

I've tried looking online, but I can't really find anything that strikes this exactly. Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?

The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)

Thanks for any help!

You can select the row with minimum `Rate` value. `df %>% group_by(ID) %>% slice(which.min(Rate))` — Ronak Shah, Feb 23 '21 at 02:12

score 0 · Answer 1 · answered Feb 23 '21 at 01:07

0

You can group by ID and then filter without summarizing, and that way you'll preserve all columns but still only keep the min value:

df %>%
   group_by(ID) %>%
   filter(Rate == min(Rate))

answered Feb 23 '21 at 01:07

redarah

152
7

Thanks for the response! Unfortunately, it returned the same thing I was having issues with, where the ID #4 was duplicated (once for each year). – Ryan B. Feb 23 '21 at 01:13
What is it you're trying to do? I understand you need the minimum value of "Rate" per ID, but when there are multiple "Years" for the lowest "Rate" (like in this example, when the 0.7 rate is the lowest for ID 4 and happens in 2010, 2011, and 2012) - which year value would you like to return? – redarah Feb 23 '21 at 01:22
To get the dataframe you say is what you want at the top, you can run: `df %>% group_by(ID) %>% filter(Rate == min(Rate)) %>% filter(Year == max(Year))` but I'm just choosing "max" because you had 2012 in your example - not sure if that's actually what you want – redarah Feb 23 '21 at 01:23
Given rates (for multiple IDs) over a series of years, I'm trying to identify the lowest rate, *and* return the corresponding year. I shouldn't have included any info about the duplicate ID #4 ... since I can just hard-code an exception into the code. Not perfect, but having trouble with an easy solution. My real problem is using summarise or mutate *and* having the corresponding year (for the lowest rate) show up in the output table. – Ryan B. Feb 23 '21 at 02:44

How to summarize data in R (dplyr) and avoid duplicate identifiers?

1 Answers1