I am trying to turn specific string patterns into binary columns for three different columns using the R programming language.
Here is what I have:
have <- structure(list(rep1 = c("china", "na", "bay", "eng", "giad",
"china", "sing", "giad", "na", "china", "china, camp", "guat,camp",
"na", "na", "cis", "trans", "stron, mon"), rep2 = c("china",
"na", "bay", "eng", "giad", "china", "sing", "giad", "na", "china",
"china, camp", "camp", "na", "na", "cis", "trans", "stron, mon"
), rep3 = c("na", "na", "bay", "eng", "giad", "china", "sing",
"giad", "china", "china", "china, camp", "camp", "na", "na",
"cis", "trans", "stron, mon")), row.names = c(NA, -17L), class = c("data.table",
"data.frame"))
And here is what I want:
want <- structure(list(rep1 = c("china", "na", "bay", "eng", "giad",
"china", "sing", "giad", "na", "china", "china, camp", "guat,camp",
"na", "na", "cis", "trans", "stron, mon"), rep2 = c("china",
"na", "bay", "eng", "giad", "china", "sing", "giad", "na", "china",
"china, camp", "camp", "na", "na", "cis", "trans", "stron, mon"
), rep3 = c("na", "na", "bay", "eng", "giad", "china", "sing",
"giad", "china", "china", "china, camp", "camp", "na", "na",
"cis", "trans", "stron, mon"), rep1_chi = c(1, 0, 0, 0, 0, 1,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), rep2_chi = c(1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0), rep3_chi = c(0, 0, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0), rep1_bay = c(0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_bay = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_bay = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep1_gia = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_gia = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_gia = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep1_sin = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep2_sin = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), rep3_sin = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-17L))
I was able to create a working solution using ifelse
and stringr::str_detect
as follows:
want <- have %>% dplyr::select(rep1, rep2, rep3) %>% mutate(
rep1_chi = ifelse(str_detect(rep1,"chi") == T,1,0),
rep2_chi = ifelse(str_detect(rep2,"chi") == T,1,0),
rep3_chi = ifelse(str_detect(rep3,"chi") == T,1,0),
rep1_bay = ifelse(str_detect(rep1,"bay") == T,1,0),
rep2_bay = ifelse(str_detect(rep2,"bay") == T,1,0),
rep3_bay = ifelse(str_detect(rep3,"bay") == T,1,0),
rep1_gia = ifelse(str_detect(rep1,"gia") == T,1,0),
rep2_gia = ifelse(str_detect(rep2,"gia") == T,1,0),
rep3_gia = ifelse(str_detect(rep3,"gia") == T,1,0),
rep1_sin = ifelse(str_detect(rep1,"sin") == T,1,0),
rep2_sin = ifelse(str_detect(rep2,"sin") == T,1,0),
rep3_sin = ifelse(str_detect(rep3,"sin") == T,1,0))
My biggest issue is that it seems rather repetitive. I was wondering if there is a more elegant solution? Considering that the "rep" columns are numerically ordered 1-3 I thought there might be a better way to program this.
Looking through SO I found the following solution using model.matrix
that seems to work nicely when you want every pattern and are only interested in a single column. I tried turning this into a function so I can select multiple columns - but I would still have to delete the strings with the patterns that are not of interest.