0

I have a dataset with multiple rows of the same individual

set.seed(420)
df <- data.frame(ind = c(rep("A",3), rep("B",5), rep("C",4)), value = seq(1:12), location = sample(c("first", "second", "third"), 12, replace = TRUE))

df
  ind value location
1    A     1    first
2    A     2    first
3    A     3   second
4    B     4   second
5    B     5    first
6    B     6    first
7    B     7    first
8    B     8    first
9    C     9    first
10   C    10    first
11   C    11    first
12   C    12    third

I would like to find the location for each individual for which the value column is highest.

So the final dataset would like:

ind    value    location
A      3        second
B      8        first
C      12       third

Is this possible to do with group_by and summarize or mutate in dplyr?

Alex Krohn
  • 137
  • 6

1 Answers1

2

There are a few ways to do it using tidyverse.

library(tidyverse)

df %>% group_by(ind) %>% slice_max(value)

Or

df %>% group_by(ind) %>% filter(value == max(value))

Addressing your question

mutate()

For mutate(), it would require extra steps to filter the data (i.e. made the data unique), since it would not shrink the data by group.

  1. First group_by as usual,
  2. Make sure we arrange by value, since we'll use the position to extract the location associated with maximum value
  3. Set value column to the max(value)
  4. Set location column to the last(location), since we've sorted the value, last(location) should be the location of the max(value)
  5. Only keep the distinct rows
df %>% group_by(ind) %>% 
  arrange(value) %>% 
  mutate(value = max(value), 
         location = last(location)) %>% 
  distinct(value, .keep_all = T)

summarise()

Similar logic from mutate() can be applied to summarise(), but we do not need the distinct() step, since the summarise() would naturally shrink the data by group, but remember we need arrange(value) to make sure the values are sorted properly.

df %>% group_by(ind) %>% 
  arrange(value) %>% 
  summarize(value = max(value), location = last(location))

Output

# A tibble: 3 x 3
# Groups:   ind [3]
  ind   value location
  <chr> <int> <chr>   
1 A         3 second  
2 B         8 first   
3 C        12 third   
benson23
  • 16,369
  • 9
  • 19
  • 38