What's the best way to build and evaluate a table of various conditions for evaluation against a dataset?
For example, let's say I want to identify invalid rows in a dataset that looks like:
library("data.table")
# notional example -- some observations are wrong, some missing
set.seed(1)
n = 100 # Number of customers.
# Also included are "non-customers" where values except cust_id should be NA.
cust <- data.table( cust_id = sample.int(n+1),
first_purch_dt =
c(sample(as.Date(c(1:n, NA), origin="2000-01-01"), n), NA),
last_purch_dt =
c(sample(as.Date(c(1:n, NA), origin="2000-04-01"), n), NA),
largest_purch_amt =
c(sample(c(50:100, NA), n, replace=TRUE), NA),
last_purch_amt =
c(sample(c(1:65,NA), n, replace=TRUE), NA)
)
setkey(cust, cust_id)
The errors I want to check for each observation are any occurrences of last_purch_dt < first_purch_dt
or largest_purch_amt < last_purch_amt
, as well as any missing values other than all or none. (All missing would be OK for a non-purchaser.)
Rather than a series of hard-coded expressions (which is getting really long and difficult to document/maintain), I just want to store the expressions as strings in a table of conditions:
checks <- data.table( cond_id = c(1L:3L),
cond_txt = c("last_purch_dt < first_purch_dt",
"largest_purch_amt < last_purch_amt",
paste("( is.na(first_purch_dt) + is.na(last_purch_dt) +",
"is.na(largest_purch_amt) +",
"is.na(last_purch_amt) ) %% 4 != 0") # hacky XOR
),
cond_msg = c("Error: last purchase prior to first purchase.",
"Error: largest purchase less than last purchase.",
"Error: partial transaction record.")
)
I know that I can loop through rows of conditions and rbindlist
the resulting subsets, for example:
err_obs <-
rbindlist(
lapply(1:nrow(checks), function(i) {
err_set <- cust[eval( parse(text= checks[i,cond_txt]) ) , ]
cbind(err_set,
checks[i, .(err_id = rep.int(cond_id, times = nrow(err_set)),
err_msg = rep.int(cond_msg, times = nrow(err_set))
)]
)
} )
)
print(err_obs) # returns desired result
which seems to work and to handle NA
s correctly in the evaluations.
When I say "what's the best way", I'm asking:
- Is this the best approach, or is there a more efficient or idiomatic alternative to
rbindlist(lapply(...)
? - Are there pitfalls in my current approach?
- Could this be written as a merge or join, something like
cust inner join checks on eval(checks.condition(cust.values)) == TRUE
?