1

Here is my dataset

mydata<-data.frame(
  id=1:20,
  sex=sample(c(rep("M",6),rep("F",14))),
  Age=round(rnorm(20, 30,2)),
  Weight=round(rnorm(20, 65,5),2)
)

I want my function to allow me to specify on which variable I want to do the filtering but also the criterion, i.e. the operator (== or > or <=...) and the value (M or 65...)

This is the function I am trying to create. I know in advance that it won't work, it's to give an idea of what I want to create.

If I don't put the variable, value and operator of selection my function must return the original database otherwise the filtered database

    my_func<-function(select_var, select_crit){
      
      mydata<-mydata<-if(is.null(select_var)&is.null(select_crit)){mydata}else{
        mydata[ which(mydata[select_var]select_crit), ]
      }
return(mydata)
    }

For example I want to be able to select all the male with my function like this

my_func(select_var="sex",select_crit="M"),

And all the induvidual > 30 (in age) like this:

my_func(select_var="Age",select_crit=">30")

or to select with the operator %in%

my_func(select_var="Age",select_crit=%in%c(30:40))

Seydou GORO
  • 1,147
  • 7
  • 13
  • outside a function it should be `mydata[ which(mydata["age"]%in%c(30:40)), ]` but I've just realised that even outside the function it doesn't work. The goal is to select a range of age – Seydou GORO Oct 27 '22 at 15:29
  • 1
    I think it would be far simpler if you used three arguments: `function(sel_var, sel_fun, sel_val)`, where one could eventually do something like `do.call(sel_fun, list(mydata[[sel_var]], sel_val))`. FYI, `mydata[select_var]` should really be `mydata[[select_var]]` in your code. – r2evans Oct 27 '22 at 15:41
  • Seems to me that you are hardcoding the data inside the function and this is not quite right. You should consider passing it as a variable. and if you do so `subset` function will be your friend. You do not need to write another function to do exactly what subset does – Onyambu Oct 27 '22 at 16:01

2 Answers2

1

You have to add a data argument inside your function and apply a combination of eval, parse and paste0 for building your filter (row selection) criterion. This approach will help:

my_func <- function(data, select_var=NULL, select_crit=NULL){
  
  if(is.null(select_var) & is.null(select_crit)){
    output <- data
  } else {
    output <- data[eval(parse(text=paste0("data", "$",select_var, select_crit))), select_var, drop=FALSE]
  }
  
  return(output)
}

Examples:

> my_func(mydata, select_var="Age", select_crit=">30")
   Age
1   32
5   32
7   33
8   31
9   33
13  31
16  33
18  32
19  32
> my_func(mydata, select_var="Age",select_crit="%in%c(30:40)")
   Age
1   32
2   30
5   32
7   33
8   31
9   33
11  30
13  31
14  30
16  33
17  30
18  32
19  32

Calling my_func(data) with select_var and select_crit with defult NULL will return your original dataset.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • Why write a function like this yet there is `subset` ie `subset(mydata, Age > 30)` for example works great. Dont you think? – Onyambu Oct 27 '22 at 16:00
  • 3
    @onyambu, I thought the same ... I think that question should be for the OP. – r2evans Oct 27 '22 at 16:01
  • Your point is well taken. But filtering the database is just one part of a longer process that needs to be done in the function. – Seydou GORO Oct 27 '22 at 17:44
1

Three suggestions:

  1. Make the data an argument of the function and not accessed via scope-breach. This helped with reproducibility, troubleshooting, maintenance, etc, and as a side-effect will allow your function to operate in %>%- and |>-pipes (if so desired).

  2. Use &&, "never" use single-& in if-conditionals unless it is wrapped in an aggregating function such as any or all. The differences between & and && are more than just vectorized-vs-nonvectorized, see Boolean operators && and ||. Further, I think you mean to use "OR" here instead of "AND", since if either one of them is null then you should not be attempting to use the operator.

  3. Change from 2-args to 3-args, separating the operator from the second operand.

Try this:

fun <- function(mydata, sel_var, sel_op, sel_val = NULL) {
  if (is.null(sel_var) || is.null(sel_op)) return(mydata)
  if (is.character(sel_op)) sel_op <- match.fun(sel_op)
  mydata[do.call(sel_op, c(list(mydata[[sel_var]]), if (!is.null(sel_val)) list(sel_val))),]
}

fun(mtcars, "cyl", "<", 5)
fun(mtcars, "cyl", "%in%", c(4, 8))
fun(mtcars, "vs", "!")

Notes:

  • sel_op can be a function or a string representing one. This gives a lot more flexibility, such as the ability to do

    fun(mtcars, "vs", Negate("!"))
    fun(mtcars, "vs", function(z) !!z)
    
  • the c(list(..), list(if (!is.null(sel_val)) ...)) is meant to allow sel_val to be empty/NULL for unary functions.

r2evans
  • 141,215
  • 6
  • 77
  • 149