3

I have a dataframe like the following:

ID     COL01_A  COL01_B   COL02_A COL02_B  ... COL12_A  COL12_B
1      01       19990101  03      20000101 ... FF       ""
2      03       20170810  FA      20120303 ... ""       ""
3      GG       19940508  DD      20000101 ... 03       20060808
4      03       20180101  09      20000101 ... ""       ""
5      GF       20171212  03      19990101 ... 02       20190101

The values in the columns type A dictate wether the value in column type B is the one i'm looking for. In this case the interest is for values "03". There are twelve pairs of this columns. As seen in the example from COL01_A/COL01_B to COL12_A/COL12_B

I was looking for a way to generate a new column (lets call it COL_X) where the value of column type B is reflected only if it's twin column of type A has the "03" value. For the example presented above, the desired result would be something like this.

ID  COL01_A  COL01_B   COL02_A COL02_B  ... COL12_A  COL12_B   COL_X
1   01       19990101  03      20000101 ... FF       ""        20000101
2   03       20170810  FA      20120303 ... ""       ""        20170810     
3   GG       19940508  DD      20000101 ... 03       20060808  20060808
4   03       20180101  09      20000101 ... ""       ""        20180101  
5   GF       20171212  03      19990101 ... 02       20190101  19990101

Right now i've solved my problem using a brutally long nested ifelse statement, which is not exactly readable nor is it a good practice (in my opinion). In terms of efficiency, it's fast, but i suppose it's only because the data is not too massive. I also found another solution using do.call(pmax(...)), But this solution requires me to clean the data frame (using ifelse statements) and creating an auxiliary dataframe with all the other information per row.

Is there a way to accomplish this in the least lines of code possible and/or not using auxiliary structures? If the solution uses data.table or dplyr it would be great.

Basic reproducible example:

ID <- c(1,2,3,4,5)
DATA <- c('xxx', 'yyy', 'zzz','xyz','zxy')
COL01_A<- c('01','03','GG','03','GF')
COL01_B<- c('19990101','20170810','19940508','20180101','20171212')
COL02_A<- c('03','FA','DD','09','03')
COL02_B<- c('20000101','20120303','20000101','20000101','19990101')
COL03_A<- c('FF','','03','','02')
COL03_B<- c('','','20060808','','20190101')

df <- data.frame(ID, DATA, COL01_A,COL01_B,COL02_A,COL02_B,COL03_A,COL03_B)

if there are several "03" values the COL_X should have ""

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213

2 Answers2

3

We can find out A and B cols using grep, then use max.col to find out row indices of values in A_cols having "03" as value then subset the corresponding value from B_cols.

A_cols <- grep("_A$", names(df))
B_cols <- grep("_B$", names(df))
df$COL_X <- df[B_cols][cbind(1:nrow(df), max.col(df[A_cols] == "03"))]

df

#  ID DATA COL01_A  COL01_B COL02_A  COL02_B COL03_A  COL03_B    COL_X
#1  1  xxx      01 19990101      03 20000101      FF          20000101
#2  2  yyy      03 20170810      FA 20120303                  20170810
#3  3  zzz      GG 19940508      DD 20000101      03 20060808 20060808
#4  4  xyz      03 20180101      09 20000101                  20180101
#5  5  zxy      GF 20171212      03 19990101      02 20190101 19990101

As updated in the comment if there are more than 1 value of "03" in a particular row then we want an empty string as output. We can add an additional line for that condition after the above and it should work.

df$COL_X <- ifelse(rowSums(df[A_cols] == "03") > 1, "", df$COL_X)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

An idea is to use split.default and split the data frame based on its column names and then coalesce, i.e.

l1 <- lapply(split.default(df[-c(1, 2)], sub('_.*', '', names(df[-c(1, 2)]))), function(i)
                                               ifelse(i[[1]] == '03', i[[2]][i[[1]] == '03'], NA)) 

Reduce(dplyr::coalesce, l1)

#[1] "20000101" "20180101" "20060808" "20180101" "20000101"

If you don't want to call another library just for one function you can follow this answer to get,

Reduce(function(x, y) {
     i <- which(is.na(x))
     x[i] <- y[i]
     x
 }, l1)
#[1] "20000101" "20180101" "20060808" "20180101" "20000101"

Stealing @RonakShah's handle for multiple 03 values,

A_cols <- grep("_A$", names(df))
df$COL_X <- ifelse(rowSums(df[A_cols] == "03") > 1, "", df$COL_X)
Sotos
  • 51,121
  • 6
  • 32
  • 66