0

I am new to R. I'm trying to write a function that will perform calculations on a column vector by groups and bind the results to the data table as a new column.

I've written a simple version of what I will eventually implement, testing just one column of values. (Eventually there will be several columns of values to be evaluated, likely in a loop inside the function). Once all the calculations are complete across multiple columns of values, I will reduce the data to the distinctive bygroups.

Because I'd like to retain all rows until the calculations across all columns are complete, I do not want to use the function aggregate. And, because I will be working with millions of records, I would like to work with data.table functions.

The code I've written to evaluate one column works outside the function but not inside the function. I am receiving this error when I try to run the function:
"Error in eval(bysub, xss, parent.frame()) : object 'id' not found" Traceback indicates the same problem the error messages points to: 5. eval(bysub, xss, parent.frame()) 4. eval(bysub, xss, parent.frame())

I have read several other entries for this error but I was not able to see how they applied to my problem. Could you please help this novice?

-- Here is a simple data sample. "phase" refers to a time period, "type" is a string var referring to a type of test, "ind_abc" and "ind_def" flag the presence of the abc or def string vars in "type", and "value" refers to the result of ab c or def tests.

tests <- c("abc", "abc", "abc", "def", "def", "def", "abc", "abc", "abc", "abc", "def","abc", "abc","def", "abc", "abc", "abc", "abc")
vec1 <- c(101, 101, 101, 101, 102, 102, 102, 102, 102, 102, 103, 103, 103, 103, 104, 104, 104, 104)
vec2 <- c(0,0,0,1,0,0,1,1,0,0,1,1,0,0,0,1,1)
vec3 <- c(1,1,1,0,0,0,1,1,1,1,0,1,1,0,1,1,1,1)
vec4 <- c(0,0,0,1,1,1,0,0,0,0,1,0,0,1,0,0,0)
vec5 <- runif(18, min=3, max = 20)
mydata <-data.frame(cbind(vec1, vec2, tests, vec3, vec4, vec5))
colnames(x=mydata) <- c("id", "phase", "type", "ind_abc", "ind_def", "value")
mydataDT <- setDT(mydata)

-- My function using data.table syntax

critical_values <- function(dt, categ, result, id_var, time_var) {
  indicator <- paste("ind_", "categ", sep="")
  new <- paste("critical_", "categ", sep="") 
  paste("dt", "2", sep="") <- dt[indicator ==1, new:= max(result), by=.(id_var, time_var)]
  return(paste("dt", 2, sep=""))
}
critical_values(mydataDT, type, value, id, phase)

-- Testing the inner function - this works

temp2dt <- mydataDT
new <- paste("critical_", "categ", sep="")
temp2dt[ind_a1c ==1, new:= max(value), by=.(id, phase)]
sunflower
  • 3
  • 2
  • 1
    I see numerous issues with your code. So many, that I'm not sure what you intend to achieve. Fundamentally, you don't seem to understand the difference between a character string and a symbol. You can't assign to a character string with `<-`. It's also not necessary, you can just return `dt[indicator ==1, new:= max(result), by=.(id_var, time_var)]`. – Roland Jan 25 '23 at 06:02
  • You don't define `type`, `value`, `id`, and `phase` anywhere. – Roland Jan 25 '23 at 06:03
  • 1
    Never use `cbind` to construct a data.frame. It creates a matrix and a matrix can only hold one data type. It's also not necessary. You can just do `data.frame(vec1, vec2, ...)` or even better do `data.table(vec1, vec2, ...)` and then use `setnames`. – Roland Jan 25 '23 at 06:05
  • I am unsure I see the point of using data.table. If you want to use the expressions passed to a function in the context of a data.table then you may find this useful: https://stackoverflow.com/questions/62040136/pass-expressions-to-function-to-evaluate-within-data-table-to-allow-for-internal. It's not complex but also not straightforward IMO. But if all you want is the max of a column called result/value by groups for a filtered dataset then consider subsetting the data.frame by the variables and then taking the max of result/value. IMO the code will be simpler to write using just base R. – Anirban Mukherjee Jan 25 '23 at 06:40
  • also keep in mind that it makes no sense for assigning an output in the function, the function returns the data and if you want to assign that to an object you do `dt2 <- critical_values()` **however** keep in mind that here your `data.table` is updated by reference, so no need to assign the results in `dt2` as it will be a copy of `mydataDT`. You can easily see that when you run your "inner function", `identical(temp2dt, mydataDT)` will return `TRUE` – Merijn van Tilborg Jan 25 '23 at 09:48
  • The expectation to be able to call `critical_values(mydataDT, type, value, id, phase)` relies wholly on [non-standard evaluation](http://adv-r.had.co.nz/Computing-on-the-language.html). Read that chapter and think long and hard about whether you _need_ to incorporate that complexity (and risk, since you're not proficient with it yet) into your code. At your level of R proficiency (not a dig), I would strongly recommend simplicity and readability of code over fancy-looking complex easy-to-go-oops code. No offense: I've been using R for 10 years, and NSE can still easily cause problems for me. – r2evans Jan 25 '23 at 13:26
  • Thank you all for your advice. Roland, you've pointed out some basic errors, thanks. Anirban, I'm using data.table because my actual data files have 3-12 million records each. Thank you for the reference to another page. r2evans, I appreciate you pointed out I created something that would have relied on non-standard evaluation. As you surmised, this was not my intent! I certainly do not yet feel comfortable with R nor am I ready to employ these tricks - not yet. Merijn, I appreciate your tip on output. I had read that in a guide to R, that it was a good practice. – sunflower Jan 26 '23 at 17:48

1 Answers1

0

If you really want to do it with a function it could work like this. There is no need to assign the data to another object, mydata will always be updated by reference too. If you truly want to keep mydata to stay unaltered you need to make a copy first and then provide that copy to the function and then only the copy get updated by reference.

critical_values <- function(dt, categ, result, id_var, time_var) {
  dt[get(paste("ind_", categ, sep = "")) == 1, (paste("critical_", categ, sep = "")) := max(get(result)), by = .(get(id_var), get(time_var))]
}

critical_values(mydata, "abc", "value", "id", "phase") # adds "critical_abc" column by reference
critical_values(mydata, "def", "value", "id", "phase") # adds "critical_def" column by reference
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22
  • 1
    Thank you Merijn! Your suggestions, and the comments above (see my responses there), motivated me to look harder for a good beginner's guide to data.tables. Now that I've found and worked through that guide, I will take a simpler approach that harnesses the tools provided by data.tables. Here is a link to that beginners' guide by Selva Prabhakaran: https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/ – sunflower Jan 26 '23 at 17:50