Replace specific chr values within groups for multiple variables in R

Question

1. Summarize the problem

Hi, I'm relatively new to R and this is my first question on stackoverflow but I’ve been learning from this site for a while. I found similar questions, but they explain how to remove missing values, work with numerical values or only work for a small number of IDs.

I have a large data frame (200 000+ rows) where one variable is an alphanumeric ID that represents unique candidates and other variables represent different characteristics. Some candidates are included multiple times in the file, but have different values for the same characteristic. I want to resolve these discrepancies to be able to remove duplicates later. The data structure is similar to this:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
                 var1 = c("No", "Yes", "No", "No", "No", "No"),
                 var2 = c("No", "No", "No", "Yes", "No", "No"),
                 var3 = c("No", "No", "No", "No", "No", "Yes"))

My goal is to first create sub groups based on ID, then search within each ID to see if they have at least one value of “Yes”, and if so change all their values to “Yes”. I want to repeat this for a few variables (var1, var2, var3). This is the results that I would like to have:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
              var1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
              var2 = c("No", "No", "No", "Yes", "Yes", "No"),
              var3 = c("No", "No", "No", "No", "No", "Yes"))

After this, I will remove duplicate rows to only keep the data that I need.

df <- distinct(df, across(), .keep_all = TRUE)

2. Describe what you’ve tried

I found partial solutions but I’m having difficulty putting it together. I can regroup my data by ID using group_by from dplyr but I'm having issues applying my other functions to the groups:

df <- df %>% group_by(ID)

I can replace the “No” with “Yes” using the if combined with any, but without the groups, it changes all the values in var1:

if(any(df$var1 == "Yes"))
  {  df$var1 = "Yes"  }

The solution I'm trying to create would be similar to Creating loop for slicing the data, loop through the duplicated positions, by using for to loop the IDs and then the variables, but without replacing with random values.

Thanks for reading the hints before posting your question. I think this should work for you: `df %>% group_by(ID) %>% summarise(across(var1:var3, ~ if_else(any(. == "Yes"),"Yes","No")))` Or perhaps `df %>% group_by(ID) %>% mutate(across(var1:var3, ~ if(any(. == "Yes"))rep("Yes",length(.)) else .))` — Ian Campbell, Jun 17 '21 at 15:24

score 4 · Accepted Answer · answered Jun 17 '21 at 15:38

I've promoted my comment to an answer to explain more.

First, we need to decide if we want to use dplyr::summarise or dplyr::mutate. summarise makes a single row for every group, whereas mutate leaves the data the same dimensions.

In your example data, all of the rows within each group will be the same after the transformation, so do you really need the duplicates? Perhaps your real data has other variables, so mutate might make sense.

From here, we just need to use dplyr::across to do the same action on each column. The first argument is to select the columns, and the second is the function you want to apply.

For mutate, we can use dplyr::ifelse to test if any variable is "Yes". If it is, we can repeat "Yes" as many times as there are rows in that group. Otherwise, we can leave the data alone. With across the data is represented by ..

df %>% 
  group_by(ID) %>%
  mutate(across(var1:var3, ~ ifelse(any(. == "Yes"),rep("Yes",length(.)),.)))
# A tibble: 6 x 4
# Groups:   ID [3]
  ID     var1  var2  var3 
  <chr>  <chr> <chr> <chr>
1 123abc Yes   No    No   
2 123abc Yes   No    No   
3 123abc Yes   No    No   
4 456def No    Yes   No   
5 456def No    Yes   No   
6 789ghi No    No    Yes

Thank you! As you described, both answers work, but since I do have other variables, the one with mutate works best. Thanks for taking the time to explain! — Max, Jun 17 '21 at 15:42

score 1 · Answer 2 · answered Jun 17 '21 at 15:58

If you're willing to use data.table, you can do all of this with lapply. This is based on @ricardo-saporta's answer to Summarizing multiple columns with data.table.

library(tibble)
library(data.table)

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
  var1 = c("No", "Yes", "No", "No", "No", "No"),
  var2 = c("No", "No", "No", "Yes", "No", "No"),
  var3 = c("No", "No", "No", "No", "No", "Yes"))

setDT(df)

any_yes <- function(x) {
  if (any(x == 'Yes')) {
    return('Yes')
  }
  
  'No'
}

df[, lapply(.SD, any_yes), by = ID]

score 0 · Answer 3 · answered Jun 19 '21 at 07:18

I more way which I have learnt from dear @akrun which obviates the need of usage ifelse

library(dplyr)

df %>% 
  group_by(ID) %>%
  mutate(across(var1:var3, ~  c('No', 'Yes')[1 + as.logical(sum(. == 'Yes'))]))

#> # A tibble: 6 x 4
#> # Groups:   ID [3]
#>   ID     var1  var2  var3 
#>   <chr>  <chr> <chr> <chr>
#> 1 123abc Yes   No    No   
#> 2 123abc Yes   No    No   
#> 3 123abc Yes   No    No   
#> 4 456def No    Yes   No   
#> 5 456def No    Yes   No   
#> 6 789ghi No    No    Yes

^{Created on 2021-06-19 by the reprex package (v2.0.0)}

Replace specific chr values within groups for multiple variables in R

3 Answers3