1

In the example code below I create a function createBucket that reads through a vector (dfVector) and a list (dfList) comprised of two sublist dataframes, "DFOne" and "DFTwo". The function creates another list of dummy dataframes for each dfList sublist dataframe where it finds the element "Boy". This example code works as intended.

This is a simplification of the code I am working on. In the actual code, the equivalents of dfVector and dfList are reactive, expanding and contracting depending on Shiny inputs. There are other lists that the function reads, and there are other conditionals imposed as the vectors and lists are read through by the function. There are also calculations that feed from one sublist to another, instead of filling the sublist dataframes with zeroes as this example does for the sake of simplicity.

Given how much is going on with this function, is using lapply() or another apply family function advisable? Speed is important, but the ultimate dataframe generated by this and related functions won't qualify for "big data" (120 rows by 100+ columns). How could I use lapply() in the below? I could run speed tests with the for-loop versus lapply().

Code:

dfVector <- function(){c("DF One","DF Two")}

dfList <- list(DFOne = c("Boy","Cat","Dog"),DFTwo = c("Boy","Rat","Bat"))

createBucket <- function(nbr_rows) {
  series <- gsub("\\s+", "", dfVector())
  buckets <- list()
  
  for (i in seq_along(series)) {
    series_name <- series[i]
    dfListOrder <- dfList[[series_name]]
    
    if ("Boy" %in% dfListOrder) {
      df_name <- paste0("bucket", gsub("\\s+", "", series_name))
      bucket <- data.frame(
        A = rep(0, nbr_rows),
        B = rep(0, nbr_rows),
        check.names = FALSE
      )
      buckets[[df_name]] <- bucket
    }
  }
  if (length(buckets) > 0) {return(buckets)} else {return(NULL)}
  }

result <- createBucket(10)
result
  • 3
    The accepted answer to [this question](https://stackoverflow.com/questions/42393658/what-are-the-performance-differences-between-for-loops-and-the-apply-family-of-f) starts by saying that [First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.](https://stackoverflow.com/a/42440872/8245406) – Rui Barradas Jun 08 '23 at 17:17
  • 2
    Given Rui's debunking comment (which I was about to say without the link, thanks Rui), the `for` loop could easily be converted into an `lapply` without added complexity; it could also be simplified a little with `Map`, though very little gained there (none of it _speed_ or efficiency). If this code is working, I don't see any particular reason to replace it unless you're refactoring for other reasons. (Again, speed is not something you'll gain.) – r2evans Jun 08 '23 at 17:22

1 Answers1

3

one approach:

createBucket2 <- function(nbr_rows){
  series <- gsub("\\s+", "", dfVector())
  series |>
    lapply(FUN = \(series_name){
      if('Boy' %in% dfList[[series_name]]){
        ## here's the actual performance boost:
        as.data.frame(matrix(0, nbr_rows, 2)) |>
          setNames(nm = c('A', 'B'))
      }
    }) |>
    setNames(nm = paste0('bucket', series)) |>
    (\(.) list(NULL, .)[[1 + (length(.) > 0)]])()
}
> identical(createBucket(10), createBucket2(10))
[1] TRUE 

edit as for speed differences, the lapply variant would be about 10% faster than the loop variant (not shown) but the real boost in performance - three times as fast - comes from creating the bucket dataframe via as.data.frame(matrix(...)) rather than via data.frame(...).

loop variant: 314.8 µs

lapply variant: 77.2 µs

(in microseconds, median of 5000 runs using {microbenchmark})

I_O
  • 4,983
  • 2
  • 2
  • 15
  • Very sleek. What does `(\(.) list(NULL, .)[[1 + (length(.) > 0)]])()` in the last line of the function mean and do? – Curious Jorge - user9788072 Jun 08 '23 at 18:28
  • 2
    It checks whether the length of the incoming list (I called it `.`) is greater than zero. If I add a number to TRUE or FALSE, the latter is converted to integer 1 or 0, plus 1 equals 1 or 2, which is the index of the list item (NULL or `.` I want to pull. `(\(foo) bar)` defines an anonymous throw-away function doing bar to the argument foo, and the trailing() executes this function with the incoming data (through to the preceding `|>`). As Rui noted, your `for` loop does what it should do, so I guess it's largely a matter of personal preference (it's "functions-on-a-string" for me :-) ). – I_O Jun 08 '23 at 18:45
  • 2
    Plz see edit: I changed the code of the lapply-variant, and it got thrice as fast as the loop version. – I_O Jun 08 '23 at 19:41