0

i am writing a package for folks who want to predict values base on AADTMAJ, L, and Base_Past. The function provides two options 1) allow the user to enter there own regression coefficients, or 2) provide the user with pre defined coefficients. However, i have not been able to use return() correctly .

input data

data=data.frame(Base_Past=c("HSM-RUR2U-KABCO",
                            "HSM-RUR2U-KABCO",
                            "HSM-RUR4-KABC",
                            "HSM-RUR4-KABCO"),
                AADTMAJ=c(100,100,100,100),
                L=c(1,1,1,1)
)

input custom regression coefficients

custom.spf=data.frame(Base_Past=c("HSM-RUR2U-KABCO","HSM-RUR2U-KABC"), a=c(-0.312,-0.19))

define helper function

helper_function = function (data, Base_Past=FALSE, override=custom.spf){
  if (is.data.frame(override)){
    for (j in 1:nrow(override)){
      for (i in 1:nrow(data)){
        if(data[i, ]$Base_Past==override[j, ]$Base_Past){
          output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(override[j, ]$a))
          return(output)} else{
            if(data[i, ]$Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
            if(data[i, ]$Base_Past=="HSM-RUR4-KABC") {a=-0.143}
            output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
            return(output)
          } 
      }
    }
  }
  
  else if (!is.data.frame(override)){
    if(Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
    if(Base_Past=="HSM-RUR4-KABC") {a=-0.143}
    output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
    return(output)
  }
}

run

(data %>% dplyr::rowwise() %>% dplyr::mutate(predicted_value = helper_function(data = data, override=custom.spf)))[,4]



Output

# A tibble: 4 x 1
# Rowwise: 
  predicted_value
            <dbl>
1          0.0267
2          0.0267
3          0.0267
4          0.0267

alternative

data %>% dplyr::mutate(predicted_value=dplyr::case_when(Base_Past =="HSM-RUR4-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.101),
                                                        Base_Past=="HSM-RUR4-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.143),
                                                        Base_Past=="HSM-RUR2U-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.312),
                                                        Base_Past=="HSM-RUR2U-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.190),
                                                        TRUE ~ NA_real_))

desired output

        Base_Past AADTMAJ L predicted_value
1 HSM-RUR2U-KABCO     100 1      0.02671733
2 HSM-RUR2U-KABCO     100 1      0.02671733
3   HSM-RUR4-KABC     100 1      0.03163652
4  HSM-RUR4-KABCO     100 1      0.03299356
cn838
  • 69
  • 8
  • (1) Typo: `override` vs `overide`. (2) You define your function to accept *and require* a `data` argument but your calls omit this. It seems as if your function is intended to operate outside of a `mutate` environment. (3) Your function references `custom.spf` but it is never defined (internally) nor passed to it. Since you have it defined in your global environment, your function is ill-defined and will be very difficult to troubleshoot or pass to others. – r2evans Mar 11 '22 at 02:08
  • FYI, `is.data.frame(.) == TRUE` is extraneous, just `is.data.frame(.)` returns true or false; similarly, replace `is.data.frame(.) == FALSE` with `!is.data.frame(.)`. – r2evans Mar 11 '22 at 02:10
  • (4) It appears your function either requires `data=` or `Base_Past`, yet in your two pipes you pass neither. (5) In the `else` clause inside your function, you define `a=` in two situations but there is no default, so you have a logical-pathway problem where if neither of the `Base_Past` values are found, `a` will not be defined. – r2evans Mar 11 '22 at 02:11
  • (6) It seems as if you might expect `Base_Past` (in the absence of `data=` being passed) would be a whole vector of data, in which case `if (Base_Past=="HSM-RUR4-KABC")` is a problem, since `if` requires length-1, and that is almost certainly longer than 1 (if/when you actually pass appropriate data in the function arguments). – r2evans Mar 11 '22 at 02:14
  • @r2evans comments addressed. please review – cn838 Mar 11 '22 at 05:58

1 Answers1

0

The function and your use of it have several problems. Notable on the list of problems since my first batch of comments:

  • You call it within a rowwise pipe but then pass data=data, which means that it is ignoring the data coming in the pipe and instead looking at the whole thing. You might instead use data=cur_data() (since it is inside of a mutate, this works, as cur_data() is defined by dplyr for situations something like this).

  • Your helper_function is ill-defined by assuming that custom.spf is defined and available. Having a function rely on the presence of external variables not explicitly passed to it makes it fragile and can be rather difficult to troubleshoot. If for instance custom.spf were not defined in the calling environment, then this function will fail with object 'custom.spf' not found. Instead, I think you could use:

    helper_function <- function(..., override=NA) {
      if (isTRUE(is.na(override)) && exists("custom.spf")) {
        message("found 'custom.spf', using it as 'override'")
        override <- custom.spf
      }
      ...
    }
    

    I'm not totally thrilled with this still, but at least it won't fail too quickly, and is verbose in what it is doing.

  • Using 1:nrow(.) can be a little risky if used programmatically. That is, if for some reason one of the inputs has 0 rows (perhaps custom.spf has nothing to override), then 1:nrow(.) should logically do nothing but instead will iterate twice over rows that do not exist. That is, if nrow(.) is 0, then note that 1:0 returns c(1, 0), which is clearly not "do nothing". Instead, use seq_len(nrow(.)), as seq_len(0) returns integer(0), which is what we would want.

  • There is no reason to use rowwise() here, and its use should be avoided whenever possible. (It does what it does very well, and when it is truly necessary, it works great. But the performance penalty for iterating one row at a time can be significant, especially for larger data.)

Some of what you are trying to do can be simplified by learning about merge/join methods. Two really good references for merge/join are: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.

Further, it seems as if a significant portion of your effort is to assign a reasonable value to a for your equation. Your inner code (looking for "-KABCO" and "-KABC") looks like it really should be yet another frame of default values.

Here's a suggested helper_function that changes things slightly. It takes as mandatory arguments Base_Past, AADTMAJ, and L, and then zero or more frames to merge/join in order to find an appropriate value for a in the equation.

helper_function <- function(Base_Past, AADTMAJ, L, ...) {
  stopifnot(
    length(Base_Past) == length(AADTMAJ),
    length(Base_Past) == length(L)
  )
  defaults <- data.frame(Base_Past = c("HSM-RUR4-KABCO", "HSM-RUR4-KABC"), a = c(-0.101, -0.143))
  frames <- c(list(defaults), list(...))
  a <- rep(NA, length(Base_Past))
  tmpdat <- data.frame(row = seq_along(Base_Past), Base_Past = Base_Past, a = a)
  for (frame in frames) {
    tmpdat <- merge(tmpdat, frame, by = "Base_Past", suffixes = c("", ".y"),
                    all.x = TRUE, sort = FALSE)
    tmpdat$a <- ifelse(is.na(tmpdat$a), tmpdat$a.y, tmpdat$a)
    tmpdat$a.y <- NULL
  }
  tmpdat <- tmpdat[order(tmpdat$row),]
  AADTMAJ * L * 365 * 10^(-6) * exp(tmpdat$a)
}

The premise is that you looking for "default" values of a in your function is really the same as looking them up in your override variable. I could have given you the override= argument for a single lookup dictionary, but it is sometimes useful to have a "one or more" type of argument: perhaps you have more than one frame with other values for a, and you may want to use them all at once. This will work as you desired for a single, but if you have multiple, perhaps custom.spf and custom.spf, this would work (by adding all of them after the L argument when called).

I chose to keep the internals of the function simple base R for a few reasons, nothing that stands out as critical. The portion that could be dplyr-ized is within the for (frame in frames) loop.

data %>%
  mutate(a = helper_function(Base_Past, AADTMAJ, L, custom.spf))
#         Base_Past AADTMAJ L          a
# 1 HSM-RUR2U-KABCO     100 1 0.02671733
# 2 HSM-RUR2U-KABCO     100 1 0.02671733
# 3   HSM-RUR4-KABC     100 1 0.03163652
# 4  HSM-RUR4-KABCO     100 1 0.03299356

The function should operate cleanly within grouping (group_by or rowwise) if you desire, but it is certainly not necessary to do what you asked originally.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • i need to rename parameter Base_Past to base_condition. however, i am receiving an error with 'by' – cn838 Mar 11 '22 at 18:09
  • My guess: change `by="Base_Past"` to `by="base_condition"`, and make sure the `defaults` frame has the correct name as well. But that's just a guess, since you didn't include the actual error. – r2evans Mar 11 '22 at 18:11
  • you are correct, the naming conventions used for 'Base_Past' column in 'custom.spf', 'data', and the 'helper_function' needed to be changed. – cn838 Mar 11 '22 at 18:22
  • Okay. Does that fix the error? – r2evans Mar 11 '22 at 18:30
  • yes. i will try variable masking – cn838 Mar 11 '22 at 19:14
  • will the helper_function operate if dots are passed to an if statement (i.e. is.data.frame(...) or is.data.frame(custom.spf))? It is important because the user may feel more comfortable defining their own regression coefficients. – cn838 Mar 11 '22 at 19:45
  • The potential mutability is vast. In my experience, in a situation like this (predefine coefficients for various "things") it can be good to pass a `data.frame`, or to pass arbitrary named-arguments such as `helper_function(..., "HSM-RUR4-KABCO"=0.3, "HSM-RUR33-KABCO"=0.212)`. There is utility in both, you need to decide which to implement (or both, but ... polymorphism begets ambiguity begets bugs and complications ... beware). – r2evans Mar 11 '22 at 20:26