1

I am making a simple R function to write files on disk consistently in the same way, but in different folders:

library(magrittr)

main_path <- here::here()

write_to_disk <- function(data, folder, name){
     data %>%
     vroom::vroom_write(
          file.path(main_path, folder, paste0(name, ".tsv"))
     )
}

I know I don't necessarily need to return anything in R functions, but if I were to, what would be the appropriate return() statement here?

Thanks a lot

r2evans
  • 141,215
  • 6
  • 77
  • 149
Giulio Centorame
  • 678
  • 4
  • 19
  • 1
    `return(invisible(NULL))` or simply `invisible(NULL)` on the last line of the body of the function – Aurèle Nov 02 '21 at 15:00
  • 1
    Alternatives, in my order of preference: (1) return the output generated by the process that writes the file; (2) return the data itself (which is what `vroom` does, btw); (3) return the filename generated, assuming all was written without error/warning; (4) wrap your inner call to save the file in `try` or `tryCatch`, and return `TRUE` if no errors were generated and `FALSE` otherwise ... though adding an attribute to `FALSE` to indicate the failure reason would be useful, too. – r2evans Nov 02 '21 at 15:20
  • 2
    *Every* function in R returns something; you don’t need to call the `return()` function [and generally shouldn’t](https://stackoverflow.com/a/59090751/1968). In your case, just forwarding whatever `vroom_write` (not `vroom`, surely?) returns seems fine. In other words: I’d leave the code as-is. – Konrad Rudolph Nov 02 '21 at 15:46
  • Yup, that was vroom_write, thanks for catching the mistake – Giulio Centorame Nov 02 '21 at 15:47

1 Answers1

6

This is prone to opinion and context, to be honest, but some thoughts:

  1. Return the original data. This spirit is conveyed in most of the tidyverse verb-functions and many other packages (and some in base R). If you're using either the %>% or |> pipes, doing this allows processing on the data after your function, which might be very convenient.

    write_to_disk <- function(data, folder, name){
      vroom::vroom_write(
        data,
        file.path(folder, paste0(name, ".tsv"))
      )
      data
    }
    

    (Your function is implicitly doing this already, since the call to vroom::vroom_write is the last expression in your function body.)

  2. Return the output from the file-writing call. I don't like this as much, frankly, because if you ever change which function is being used by your wrapper, then the return value of your function may very well change. I don't know the life-cycle expectancy of your wrapper function, but imagine if you choose to switch from vroom::vroom to another function; vroom returns a subset of the data based on col_select, and perhaps the newer function will return the whole data, which may break assumptions of downstream processing.

    write_to_disk <- function(data, folder, name){
      out <- vroom::vroom_write(
        data,
        file.path(folder, paste0(name, ".tsv"))
      )
      out
    }
    

    Note: I explicitly chose to capture into out and return it in case you ever add code between vroom::vroom_write and subsequent out. Your original function unchanged is in effect doing the same thing, but if you choose to do anything post-vroom_write, then this extra step would be necessary.

    Otherwise, your function is implicitly doing this already, since vroom::vroom_write returns the data.

  3. Return the filename. This is only useful if the filename is not necessarily know a priori. For instance, if your wrapper takes care to not overwrite same-named files, it might add a counter (pre-extension) so that overwriting is never going to happen. In that case, the calling environment does not know what the chosen filename is, so it has value (sometimes).

    write_to_disk <- function(data, folder, name){
      # file numbering
      re <- paste0("^", name, "_?([0-9]+)?\\.tsv$")
      existfiles <- list.files(folder, pattern = re, full.names = TRUE)
      nextnum <- max(0L, suppressWarnings(as.integer(gsub(re, "\\1", basename(existfiles)))), na.rm = TRUE)
      if (nextnum > 0) {
        name <- sprintf("%s_%03i", name, nextnum + 1L)
      }
      filename <- file.path(folder, paste0(name, ".tsv"))
      vroom::vroom_write(
        data,
        filename
      )
      filename
    }
    

    (The "file numbering" code offered only as an example of why I think returning the filename might make sense.)

  4. Return the success of the writing function. This would likely require using try or tryCatch (or any of the tidyverse equivalents), catching errors, and reacting accordingly.

    write_to_disk <- function(data, folder, name){
      res <- tryCatch(
        vroom::vroom_write(
          data,
          file.path(folder, paste0(name, ".tsv"))
        ),
        error = function(e) e
      )
      out <- !inherits(res, "error")
      if (!out) {
        attr(out, "error") <- conditionMessage(res)
      }
      out
    }
    
  5. Return nothing. This is the easiest, certainly. You'd need to do it explicitly so that you don't inadvertently return the return-value from the file-writing function.

    write_to_disk <- function(data, folder, name){
      vroom::vroom_write(
        data,
        file.path(folder, paste0(name, ".tsv"))
      )
      NULL
    }
    

Notes:

  1. Your use of main_path is counter to functional programming, since the function behaves differently given identical inputs based on the presence of something outside of its immediate scope. I argue it's better to pass write_to_dist(x, file.path(main_path, folder), "somename") since main_path is defined in that environment (not within the function), and your function would be general enough to not require that variable be defined correctly.

    I've updated all of the code above to reflect this good practice. If you feel strongly enough against this, feel free to add back in main_path in your preferred locations.

  2. It might be useful for any of the above to be returned invisibly, so that (for instance) saving a large data.frame without capturing its return value does not flood the console with the data. This is easy enough to do with invisible(data) and changes nothing of the return value (other than that it is not printed on the console by default).

  3. FYI: Konrad and I have gone back-and-forth in the comments about whether return(.) is a good idea or not. I don't disagree with most of the claims, and argue regardless that it can be as much about style and opinion than much else. Regardless, since most of my arguments for return are moot in all of the above code, I removed it for succinctness.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    OP wouldn’t have to change their code to perform the identical options (1)/(2): `vroom_write` *already* invisibly returns the data. – Konrad Rudolph Nov 02 '21 at 15:47
  • Imperfectly, yes, but `vroom::vroom` does subset the data if `col_select=` is used (not strictly relevant here, but then again it might be a reduced-complexity wrapper function for the sake of a question on SO). And yes, I know `vroom` returns it, I said that in #2: *"original function unchanged is in effect doing the same thing"*. – r2evans Nov 02 '21 at 15:50
  • 1
    No, OP made a mistake in their question, they’re using `vroom_write`, not `vroom`. – Konrad Rudolph Nov 02 '21 at 15:51
  • Okay, yes I see the edit now. That changes my previous *comment* but not the answer, which is still correct. Thanks. – r2evans Nov 02 '21 at 15:53
  • 1
    OK, then let me adjust my comment: since *every* function in R returns a value, and this is well known as a *fundamental property* of R, I disagree (fairly strongly) with your claims about using `return` being more explicit. It’s more *verbose*, but not even a tiny bit more explicit. And for this reason I recommend against this practice. – Konrad Rudolph Nov 02 '21 at 16:00
  • BTW, *generally shouldn't use `return`* is *opinion* and *style*. I wholly respect that it's your opinion. There are clear exceptions to your linked answer, and the answer makes no effort to make any qualification. For instance: (1) early func-exit, such as `if (length(x) < 2) return(x)`; this is not contrived, it happens a *lot*, and a really-long-`if`-statement is not a better alternative; (2) relying on "last expression" can and has broken code in the past when somebody inadvertently adds (cleanup) code after the pivotal call. – r2evans Nov 02 '21 at 16:05
  • 1
    I admit that many of my functions use the *implicit* last-expression-returned mindset, but the declaration that one "shouldn't" with zero exceptions is bordering on hyperbolic. The "generally" starts this discussion, but really there are clear examples (again, *in my opinion*) that are much improved (readable, maintainable, less error-prone) with an explicit `return` statement. (Yes, this is opinion, but I think stating it as an "R Best Practice" (not your words) is over the top.) – r2evans Nov 02 '21 at 16:05
  • 1
    Bottom line, @KonradRudolph, I think an explicit call to `return` is unnecessary in perhaps 80% of the cases ... but the overhead induced by the primitive function call is *trivial*, and it just might make a function-writer's intentions clearer in some of those cases. Thanks for the discussion :-) You aren't wrong in your opinion :-) – r2evans Nov 02 '21 at 16:06
  • Apologies for derailing the topic, but I find the “it’s a style” a tired and overused argument in programming discussions. I claim that there are good technical reasons why one style is superior (which I’ve detailed, along with the exceptions, in the answer I linked under the question — and performance is *not* amongst the reasons). They’re not equally good styles. By the way, I appreciate your courtesy of not putting words in my mouth but in this case I’m actually happy with calling my preference best practice since it naturally emerges from R’s execution model. – Konrad Rudolph Nov 02 '21 at 16:10
  • Your arguments are strong, certainly, and perhaps I'll adjust how I frame things in this regard. Thanks. (I updated my answer, btw, hope you don't mind the mention @KonradRudolph.) – r2evans Nov 02 '21 at 16:16
  • 1
    Re `main_path`: my script imports a user-specified JSON file at the beginning with all the relative file paths where to save things. Those have to be separated from the script as it's running on a cluster and I can't save the output in any relative folder. The thing I *really* want to avoid is to specify `file.path()` too many times in my script. I guess a better way would be something like `path_folder <- file.path(here::here(), "folder"); a <- function(data, folder, "name"){...}`? – Giulio Centorame Nov 02 '21 at 16:48
  • 1
    Yes, I think your last code here makes the most sense to me. I understand that this function is only used within this script, so the risk is certainly low, but it's too easy for that kind of code/thinking to blur into non-single-script uses, where its non-*reproducibility* is a not-insignificant liability. – r2evans Nov 02 '21 at 16:50