This is a follow-up-question from Efficient way to subset data.table based on value in any of selected columns.
sample data
I have got a data.table with 5 p-columns, indicating a type (type1 or type2 or NA
).
I also have got 5 r-columns, indicating a score (1-10, or NA
).
library(data.table)
set.seed(123)
v <- c( "type1", "type2", NA_character_ )
v2 <- c( 1:10, rep( NA_integer_, 10 ) )
DT <- data.table( id = 1:100,
p1 = sample(v, 100, replace = TRUE ),
p2 = sample(v, 100, replace = TRUE ),
p3 = sample(v, 100, replace = TRUE ),
p4 = sample(v, 100, replace = TRUE ),
p5 = sample(v, 100, replace = TRUE ),
r1 = sample(v2, 100, replace = TRUE ),
r2 = sample(v2, 100, replace = TRUE ),
r3 = sample(v2, 100, replace = TRUE ),
r4 = sample(v2, 100, replace = TRUE ),
r5 = sample(v2, 100, replace = TRUE ))
desired output
I want to create two new columns (one for type1 and one for type2) where I check rowwise if type1/type2 has occured in one or more of the p-columns, and if at least one of the corresponding r-column (p1 -> check r1, p2 -> check r2, etc.) contains a value.
'manual' solution
This can be done like below, using a lot of AND and OR statements:
manual_solution <- DT[ ( p1 == "type1" & !is.na( r1 ) ) |
( p2 == "type1" & !is.na( r2 ) ) |
( p3 == "type1" & !is.na( r3 ) ) |
( p4 == "type1" & !is.na( r4 ) ) |
( p5 == "type1" & !is.na( r5 ) ),
type1_present := "yes"]
manual_solution <- DT[ ( p1 == "type2" & !is.na( r1 ) ) |
( p2 == "type2" & !is.na( r2 ) ) |
( p3 == "type2" & !is.na( r3 ) ) |
( p4 == "type2" & !is.na( r4 ) ) |
( p5 == "type2" & !is.na( r5 ) ),
type2_present := "yes"]
manual_solution[ is.na( type1_present ), type1_present := "no" ]
manual_solution[ is.na( type2_present ), type2_present := "no" ]
Question: automation for dozens of p and r-columns
But looking at the answers from Efficient way to subset data.table based on value in any of selected columns, I'm convinced there are better ways. Especially since my production data contains A LOT more p-columns and r-columns.
So I started playing around, but got stuck pretty fast...
#build vectors p-columns and r-columns
p_cols <- grep( "^p", names( DT ), value = TRUE )
r_cols <- grep( "^r", names( DT ), value = TRUE )
#create logical vectors to test for NA
logi_p <- as.data.table( sapply( DT[, ..p_cols ], function(x) !is.na(x) ) )
logi_r <- as.data.table( sapply( DT[, ..r_cols ], function(x) !is.na(x) ) )
#which non-NA p-values also have a non-NA r-value?
logi <- as.data.table( sapply( logi_p * logi_r, as.logical ) )
And now I havent't got any inspiration left on how to proceed.
Any ideas/suggestions?
bonus
My main focus is on the question above. But my production data also contains a lot more types (in the p-columns)... So a solution that adds a column by type (or can dcast to this result), would 'kill two birds with one stone'.