I'm looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here's an example:
artificial data
df = data.frame(
col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)
The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are:
- If the value 'Yes' is present in any of the columns, the value of the new_column will be 'Yes';
- If the value 'Yes' is not present in any of the columns, but the value 'No' is present, then the value of the new_column will be 'No';
- If the values 'Yes' and 'No' are absent, then the value of new_columns will be 'Unknown'.
I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.
Current solution:
library(dplyr)
df_1 <-
df %>%
mutate(
new_column = ifelse(
(col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
ifelse(
(col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
)
)
)
I'm looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.
After searching for StackOverflow, I couldn't find a way to my problem (I know there are several posts about creating a new column based on values obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.
I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.