Summarising status from multiple rows in R

Question

I'd like to know how to summarise the status of a particular ID across multiple rows. I know how to do this in SAS with a BY statement, first. and last. but unsure how to achieve the same thing in R.

Example of data that needs to be summarise, i.e. for each ID, is the colour column: "red only", "blue only", or "both"

Generate data frame

example <- data.frame(id = c("A1", "A1", "A1", "A2", "A3", "A3", "A4", "A4", "A4", "A5", "A5", "A6"),
                      colour = c("red", "red", "blue", "red", "blue", "blue", "red", "red", "red", "red", "blue", "red"))

Output table

   id colour
1  A1    red
2  A1    red
3  A1   blue
4  A2    red
5  A3   blue
6  A3   blue
7  A4    red
8  A4    red
9  A4    red
10 A5    red
11 A5   blue
12 A6    red

Desired result

  id    status
1 A1      both
2 A2  red only
3 A3 blue only
4 A4  red only
5 A5      both
6 A6  red only

Equivalent Code in SAS would be:

data table1 (keep=id status);
set example;
by id;
retain red_count blue_count;

if first.id then do;
  red_count = 0;
  blue_count = 0;
end;

if colour = "red" then red_count+1;
if colour = "blue" then blue_count+1;

if last.id then do;
  if red_count > 0 and blue_count > 0 then status = "both";
  else if red_count > 0  then status = "red only";
  else if blue_count > 0 then status = "blue only";
  output;
end;

run;

`aggregate(colour ~ id, example, FUN = function(z) paste(sort(unique(z)), collapse = " "))` is a good start, where you can then either merge/join onto the new `colour` field to replace `"blue red"` with `"both"`, etc, or you can do some basic transforms on it to get the exact strings you need. This is partially a dupe of https://stackoverflow.com/q/11562656/3358272, https://stackoverflow.com/q/1660124/3358272 in that you're trying to summarize by group; see those links for other ways to do this aggregation. — r2evans, Apr 25 '23 at 14:14

score 3 · Accepted Answer · answered Apr 25 '23 at 14:15

You can use dplyr to make this pretty straight forward

library(dplyr)
example %>% 
  group_by(id) %>% 
  summarize(status = case_when(
    all(colour=="red") ~ "red only",
    all(colour=="blue") ~ "blue only",
    TRUE ~ "both"
  ))
#   id    status   
#   <chr> <chr>    
# 1 A1    both     
# 2 A2    red only 
# 3 A3    blue only
# 4 A4    red only 
# 5 A5    both     
# 6 A6    red only

score 1 · Answer 2 · answered Apr 25 '23 at 14:13

Get the distinct rows, then do a group by paste based on condition of checking the number of elements

library(dplyr)# version >= 1.1.0
library(stringr)
example %>% 
  distinct() %>% 
  reframe(status = if(n() > 1) 'both' else sprintf('%s only', colour), .by = id)

-output

  id    status
1 A1      both
2 A2  red only
3 A3 blue only
4 A4  red only
5 A5      both
6 A6  red only

score 1 · Answer 3 · answered Apr 25 '23 at 14:17

Create a custom function:

summarise_colour <- function(x){
  if(n_distinct(x)>=2){
    return("both")
  }
  if("red" %in% x){
    return("red only")
  }
  if("blue" %in% x){
    return("blue only")
  }
  
}

example %>%
  group_by(id) %>%
  summarise(status=summarise_colour(colour)) %>%
  ungroup

  id    status   
  <fct> <chr>    
1 A1    both     
2 A2    red only 
3 A3    blue only
4 A4    red only 
5 A5    both     
6 A6    red only

score 1 · Answer 4 · answered Apr 25 '23 at 14:28

The base R equivalent:

aggregate(colour ~ id, data = example, FUN = \(i) {if (length(unique(i)) == 2) 
   { "both"  } else 
     {sprintf("%s only", unique(i))}})

  id    colour
1 A1      both
2 A2  red only
3 A3 blue only
4 A4  red only
5 A5      both
6 A6  red only

Summarising status from multiple rows in R

4 Answers4