0

I see that if you have a dataframe (list) or vector (list), you can use lapply with predicate:

lapply(myDataframe, subset, x3 < 10)

But how can I store predicates into variables, so I can programmatically do:

myPredicate = x3 < 10  # where x3 references a column name of whatever dataframe is applied later
lapply(myDataframe, subset, myPredicate)
ppp
  • 111
  • 7
  • `myPredicate <- myDataframe$x3 < 10` will work. It doesn't store the predicate, but the logical vector produced by the predicate. – Allan Cameron Sep 11 '20 at 20:34
  • 1
    What is your goal here? For a start, have a look at `help(lapply)` which arguments it takes and have a look at an [R tutorial](https://www.codecademy.com/learn/learn-r) – starja Sep 11 '20 at 20:37
  • Possibly `myPredicate = quote(x3 < 10)` will get you started but you can't pass that directly to `subset`, you would need to inject that into the call. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Sep 11 '20 at 20:39
  • my goal is to store lists of different predicates (conditionals) that can be ANDed together into a bigger predicate, then applied. For example, block my dataframe by gender == “M” , then block by age in 20:30, but then easily do the same analysis by switching lists to gender == “F”, then block by age in 20:30 , or whatever else combo’s of conditionals I want to try blocking by – ppp Sep 11 '20 at 20:48

1 Answers1

1

Suppose you have a data frame like this:

df <- data.frame(x = 1:10,
                 y = 96:105, 
                 z = rep(c("A", "B", "C"), length.out = 10))

And you want to store a list of named predicates you can apply. You can do this simply by storing the logical vectors produced by the predicates in a list:

p  <- list(x_is_even = df$x %% 2 == 0,
           y_gt_100  = df$y > 100,
           z_is_A    = df$z == "A")

These can be combined like any other logical conditions:

subset(df, p$x_is_even & p$z_is_A & p$y_gt_100)
#>     x   y z
#> 10 10 105 A

If you want to do it in such a way that you can pass "bare" predicates (i.e. those that name columns without naming a data frame) then that is far harder.

The reason is that you would have to store the predicates as language objects. When you come to use these, it doesn't make sense to combine them with logical operators like & or |, because these operations are not defined for language objects.

It is possible, but it requires a bit of programming on the language. I realise you were hoping for something simple, but there is no way to do this simply in base R. I will show how it could be achieved and you can decide whether it is worth the trouble.


First you need a way of creating a list of quoted predicates:

make_subsets <- function(...)
{
  as.list(match.call()[-1])
}

So you can do

p  <- make_subsets(x_is_even = x %% 2 == 0,
                   y_gt_100  = y > 100,
                   z_is_A    = z == "A")
p
#> $x_is_even
#> x%%2 == 0
#> 
#> $y_gt_100
#> y > 100
#> 
#> $z_is_A
#> z == "A"

Now you need to be able to build these together arbitrarily into a call:

parse_subsets <- function(expr)
{
  expr <- as.list(match.call()$expr)
  this_call <- as.character(expr[[1]])
  if(this_call == "&" | this_call == "|")
  {
    l <- unlist(lapply(expr[-1], function(x) {
                 eval(as.call(list(parse_subsets, x)))}))
    as.call(append(l, as.symbol(this_call), 0))
  }
  else return(eval(as.call(expr)))
}

And now you need a function that can take your predicates and filter data:

subset2 <- function(data, subsets)
{
  ss <- match.call()$subsets
  ss <- eval(as.call(list(parse_subsets, ss)))
  eval(as.call(list(subset, data, ss)))
}

So now you can do

subset2(df, p$x_is_even & p$y_gt_100 & p$z_is_A)
#>     x   y z
#> 10 10 105 A

Note, howver, if you want to use lapply on this, you will need to do it the long way:

lapply(list(df, df), function(x) subset2(x, p$x_is_even & p$y_gt_100 & p$z_is_A))
#> [[1]]
#>     x   y z
#> 10 10 105 A
#>
#> [[2]]
#>     x   y z
#> 10 10 105 A
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thank you. This is interesting, but it refers to “df” — any way to have the predicate not tied to any df in particular, but only by column name? – ppp Sep 12 '20 at 04:04
  • 1
    @ppp that's far harder and more complex. I have shown an example of one approach in my edited answer. – Allan Cameron Sep 12 '20 at 08:32
  • Thanks for the kind solution. I have selected it as the solution, because I think this is the best that R seems to be able to do. It really is a shame that R has not made modifications into its language to allow for a much more straightforward predicate object. – ppp Sep 13 '20 at 04:51