1. Summarize the problem
Hi, I'm relatively new to R
and this is my first question on stackoverflow but I’ve been learning from this site for a while. I found similar questions, but they explain how to remove missing values, work with numerical values or only work for a small number of IDs.
I have a large data frame (200 000+ rows) where one variable is an alphanumeric ID that represents unique candidates and other variables represent different characteristics. Some candidates are included multiple times in the file, but have different values for the same characteristic. I want to resolve these discrepancies to be able to remove duplicates later. The data structure is similar to this:
df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
var1 = c("No", "Yes", "No", "No", "No", "No"),
var2 = c("No", "No", "No", "Yes", "No", "No"),
var3 = c("No", "No", "No", "No", "No", "Yes"))
My goal is to first create sub groups based on ID, then search within each ID to see if they have at least one value of “Yes”, and if so change all their values to “Yes”. I want to repeat this for a few variables (var1, var2, var3). This is the results that I would like to have:
df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
var1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
var2 = c("No", "No", "No", "Yes", "Yes", "No"),
var3 = c("No", "No", "No", "No", "No", "Yes"))
After this, I will remove duplicate rows to only keep the data that I need.
df <- distinct(df, across(), .keep_all = TRUE)
2. Describe what you’ve tried
I found partial solutions but I’m having difficulty putting it together. I can regroup my data by ID using group_by
from dplyr
but I'm having issues applying my other functions to the groups:
df <- df %>% group_by(ID)
I can replace the “No” with “Yes” using the if
combined with any
, but without the groups, it changes all the values in var1:
if(any(df$var1 == "Yes"))
{ df$var1 = "Yes" }
The solution I'm trying to create would be similar to Creating loop for slicing the data, loop through the duplicated positions, by using for
to loop the IDs and then the variables, but without replacing with random values.