I am interested in doing string detection and value comparison across a range of columns. If the string (which in this case is ZSD
) is found in columns, then their corresponding values from another column need to be compared.
Input
My input is as below:
a.zsd a.test b.zsd b.test c.zsd c.test d.zsd d.test
'ZSD' 0.0 'ZAD' 1.0 NA 0.5 'ZAD' 1.0
'ZAD' 1.0 NA 0.0 NA 0.5 'ZSD' 0.0
NA 0.5 NA 0.5 'ZAD' 0.5 NA 0.5
'Not Achieved ZSD' 0.0 NA 0.5 'ZAD' 0.5 NA 0.5
'ZSD' 1.0 'ZSD' 0.5 NA 0.5 'ZSD' 0.0
NA 0.0 NA 0.0 NA 0.5 NA 0.0
NA 1.0 'ZSD' 0.0 'ZSD' 0.5 'ZSD' 1.0
Output
In my output, I want two additional columns smallest.test
and zsd.level
:
a.zsd a.test b.zsd b.test c.zsd c.test d.zsd d.test smallest.test zsd.level
'ZSD' 0.0 'ZAD' 1.0 NA 0.5 'ZAD' 1.0 0.0 a
'ZAD' 1.0 NA 0.0 NA 0.5 'ZSD' 0.0 0.0 d
NA 0.5 NA 0.5 'ZAD' 0.5 NA 0.5 0.0 NA
'Not Achieved ZSD' 0.0 NA 0.5 'ZAD' 0.5 NA 0.5 0.0 a
'ZSD' 1.0 'ZSD' 0.5 NA 0.5 'ZSD' 0.0 0.0 d
NA 0.0 NA 0.0 NA 0.5 NA 0.0 0.0 NA
NA 1.0 'ZSD' 0.0 'ZSD' 0.5 'ZSD' 1.0 0.0 b
Info:
My data frame has over a hundred columns. I am interested in ONLY some of the columns having a name that ends at a string .zsd
. These columns can either have NA
or one of the following string values ZAD
, ZSD
, Not Achieved ZSD
. Each column with the .zsd
string name, has an associated .test
column.
Requirements
I want two new columns in the output smallest.test
and zsd.level
. The requirements are as below:
Iterate through the column names ending with the string
.zsd
Across those columns detect the string
ZSD
If the
ZSD
string is found in only one of the columns, return the names of that column in the output columnzsd.level
and also return the corresponding value from the column name ending at.test
to be returned to the output columnsmallest.test
.If none of the columns contains the string
ZSD
, return NA in the output columnzsd.level
and return 0.0 in the corresponding output columnsmallest.test
.If more than one columns contain the string
ZSD
, pick the column with the least value of the corresponding.test
column and return in the output.If more than one columns contain the string
ZSD
, and they all have the same value of the corresponding.test
column, then pick the last column name for the output and the corresponding value of the.test
for the output.
dput()
dput(df)
structure(list(a.zsd = c("ZSD", "ZAD", NA, "Not Achieved ZSD", "ZSD", NA, NA),
a.test = c(0, 1, 0.5, 0, 1, 0, 1),
b.zsd = c("ZAD", NA, NA, NA, "ZSD", NA, "ZSD"),
b.test = c(1, 0, 0.5, 0.5, 0.5, 0, 0),
c.zsd = c(NA, NA, "ZAD", "ZAD", NA, NA, "ZSD"),
c.test = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5),
d.zsd = c("ZAD", "ZSD", NA, NA, "ZSD", NA, "ZSD"),
d.test = c(1, 0, 0.5, 0.5, 0, 0, 1)),
class = "data.frame", row.names = c(NA, -7L))
Partial solution
Based on the following post: String matching over multiple columns with specific string names, this code can iterate and select the .zsd
columns and return the highest column name in the output. But it does not take into account the corresponding values of the .test
field. Any help on this would be greatly appreciated.
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(contains("zsd"), ~case_when(str_detect(., "ZSD") ~ cur_column()), .names = 'new_{col}')) %>%
unite(zsd_level, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
mutate(zsd_level = str_remove_all(zsd_level, ".zsd"),
zsd_level = str_sub(zsd_level, -1))