2

What's the best way to build and evaluate a table of various conditions for evaluation against a dataset?

For example, let's say I want to identify invalid rows in a dataset that looks like:

library("data.table")

# notional example -- some observations are wrong, some missing
set.seed(1)
n = 100 # Number of customers.
        # Also included are "non-customers" where values except cust_id should be NA.
cust <- data.table( cust_id = sample.int(n+1),
                    first_purch_dt =
                      c(sample(as.Date(c(1:n, NA), origin="2000-01-01"), n), NA),
                    last_purch_dt = 
                      c(sample(as.Date(c(1:n, NA), origin="2000-04-01"), n), NA),
                    largest_purch_amt = 
                      c(sample(c(50:100, NA), n, replace=TRUE), NA),
                    last_purch_amt = 
                      c(sample(c(1:65,NA), n, replace=TRUE), NA)
                    )
setkey(cust, cust_id)

The errors I want to check for each observation are any occurrences of last_purch_dt < first_purch_dt or largest_purch_amt < last_purch_amt, as well as any missing values other than all or none. (All missing would be OK for a non-purchaser.)

Rather than a series of hard-coded expressions (which is getting really long and difficult to document/maintain), I just want to store the expressions as strings in a table of conditions:

checks <- data.table( cond_id = c(1L:3L),
                      cond_txt = c("last_purch_dt < first_purch_dt",
                                  "largest_purch_amt < last_purch_amt",
                                  paste("( is.na(first_purch_dt) + is.na(last_purch_dt) +",
                                          "is.na(largest_purch_amt) +",
                                          "is.na(last_purch_amt) ) %% 4 != 0") # hacky XOR  
                                  ),
                      cond_msg = c("Error: last purchase prior to first purchase.",
                                   "Error: largest purchase less than last purchase.",
                                   "Error: partial transaction record.")
                     )

I know that I can loop through rows of conditions and rbindlist the resulting subsets, for example:

err_obs <- 
  rbindlist(
    lapply(1:nrow(checks), function(i) {
      err_set <- cust[eval( parse(text= checks[i,cond_txt]) ) ,  ]
      cbind(err_set, 
            checks[i, .(err_id = rep.int(cond_id, times = nrow(err_set)),
                        err_msg = rep.int(cond_msg, times = nrow(err_set))
                        )]
            )                
    } )
  )
print(err_obs) # returns desired result

which seems to work and to handle NAs correctly in the evaluations.

When I say "what's the best way", I'm asking:

  • Is this the best approach, or is there a more efficient or idiomatic alternative to rbindlist(lapply(...)?
  • Are there pitfalls in my current approach?
  • Could this be written as a merge or join, something like cust inner join checks on eval(checks.condition(cust.values)) == TRUE?
Community
  • 1
  • 1
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134

1 Answers1

3

This is how I'd do it:

checks[, cust[eval(parse(text = cond_txt), .SD)][, err_msg := cond_msg], by = cond_id]

The only non-trivial part in the above is the presence of .SD - see this question for an explanation.

Community
  • 1
  • 1
eddi
  • 49,088
  • 6
  • 104
  • 155
  • This works, thanks very much. It's just going to take me a few minutes to understand why. I might add a short explanation if that's OK with you. – C8H10N4O2 Mar 17 '16 at 16:59
  • sure, feel free to – eddi Mar 17 '16 at 17:02
  • If you don't mind a quick clarification, why is `[, err_msg := cond_msg]` necessary? In other words, why does taking that step out drop `cond_msg` entirely instead of just keeping it without renaming it? Is it because the environment at that point is still `cust` and we have to explicitly return it from `checks`? – C8H10N4O2 Mar 17 '16 at 17:35
  • 1
    Recall that `checks[, j, by = cond_id]` simply evaluates `j` for each `cond_id`. And `j=cust[...]` evaluates to a `data.table` that doesn't have that error/cond message column, so you need to add it. The above is one way to do it. Another would've been to do `by = .(cond_id, cond_msg)`. – eddi Mar 17 '16 at 17:40
  • Thanks, this answer really demonstrates the power of `data.table`. I'm not at all sure I would have been able to do this with `dplyr::inner_join` or other approach. I think I would have had to do a cartesian product and then filter on `eval()`. – C8H10N4O2 Mar 17 '16 at 17:46