10

Here issue in R scripts

I am trying to understand how would here() work in a portable way. Found it: See what works later under Final answer - TL;DR - the bottom line, here() is not really that useful running a script.R from commandline.

The way I understand it with help from JBGruber: here() looks for the root directory of a project (e.g., an RStudio project, Git project or other project defined with a .here file) starting at the current working directory and moving up until it finds any project. If it doesn't find anything it falls back to using the full working directory. Which in case of a script run by cron will default to my home directory. One could, of course, pass directory as a parameter via cron command, but it is rather cumbersome. Below answers provide good explanations and I have summarised what I found most immediately useful under "Final Answer section". But make no mistake, Nicola's answer is very good and helpful too.

Original Objective - write a set of R scripts, including R-markdown .Rmd so that I can zip the directory, send to someone else and it would run on their computer. Potentially on a very low end computer - such as RaspberryPi or old hardware running linux.

Conditions:

  • can be run from commandline via Rscript
  • as above but scheduled via cron
  • main method for setting up working directory is set_here() - executed once from console and then the folder is portable because the .here file is included on the zipped directory.
  • does not need Rstudio - hence do not want to do R-projects
  • can also be run interactively from Rstudio (development)
  • can be executed from shiny (I assume that will be OK if the above conditions are met)

I specifically do not want to create Rstudio projects because in my view it necessitates to install and use Rstudio, but I want my scripts to be as portable as possible and run on low resource, headless platforms.

Sample code:

Let us assume the working directory will be myGoodScripts as follows:

/Users/john/src/myGoodScripts/

when starting development I would go to the above directory with setwd() and execute set_here() to create .here file. Then there are 2 scripts dataFetcherMailer.R, dataFetcher.Rmd and a subdirectory bkp:

dataFetcherMailer.R

library(here)
library(knitr)

basedir <- here()
# this is where here should give path to .here file

rmarkdown::render(paste0(basedir,"/dataFetcher.Rmd"))

# email the created report
# email_routine_with_gmailr(paste0(basedir,"dataFetcher.pdf"))
# now substituted with verification that a pdf report was created
file.exists(paste0(basedir,"/dataFetcher.pdf"))

dataFetcher.Rmd

---
title: "Data collection control report"
author: "HAL"
date: "`r Sys.Date()`"
output: pdf_document
---

```{r setup, include=FALSE}
library(knitr)
library(here)

basedir <- here()

# in actual program this reads data from a changing online data source
df.main <- mtcars

# data backup
datestamp <- format(Sys.time(),format="%Y-%m-%d_%H-%M")
backupName <- paste0(basedir,"/bkp/dataBackup_",datestamp,"csv.gz")
write.csv(df.main, gzfile(backupName))
```

# This is data collection report

Yesterday's data total records: `r nrow(df.main)`. 

The basedir was `r basedir`

The current directory is `r getwd()`

The here path is `r here()`

The last 3 lines in the report would be matching, I guess. Even if getwd() does not match the other two, it should not matter, because here() would ensure an absolute basepath.

Errors

Of course - the above does not work. It only works if I execute Rscript ./dataFetcherMailer.R from the same myGoodScripts/ directory.

My aim is to understand how to execute the scripts so that relative paths are resolved relative to the script's location and the script can be run from commandline independent of the current working directory. I now can run this from bash only if I have done cd to the directory containing the script. If I schedule cron to execute the script the default working directory would be /home/user and script fails. My naive approach that regardless of the shell's current working directory basedir <- here() should give a filesystem point from which relative paths could be resolved is not working.

From Rstudio without prior setwd()

here() starts at /home/user
Error in abs_path(input) : 
The file '/home/user/dataFetcher.Rmd' does not exist.

From bash with Rscript if cwd not set to the script directory.

$ cd /home/user/scrc
$ Rscript ./myGoodScripts/dataFetcherMailer.R 
here() starts at /home/user/src
Error in abs_path(input) : 
The file '/home/user/src/dataFetcher.Rmd' does not exist.
Calls: <Anonymous> -> setwd -> dirname -> abs_path

If someone could help me understand and resolve this problem, that would be fantastic. If another reliable method to set basepath without here() exists, I would love to know. Ultimately executing script from Rstudio matters a lot less than understanding how to execute such scripts from commandline/cron.

Update since JBGruber answer:

I modified the function a little so that it could return either filename or directory for the file. I am currently trying to modify it so that it would work when .Rmd file is knitted from Rstudio and equally run via R file.

here2 <- function(type = 'dir') {
  args <- commandArgs(trailingOnly = FALSE)
  if ("RStudio" %in% args) {
    filepath <- rstudioapi::getActiveDocumentContext()$path
  } else if ("interactive" %in% args) {
    file_arg <- "--file="
    filepath <- sub(file_arg, "", grep(file_arg, args, value = TRUE))
  } else if ("--slave" %in% args) {
    string <- args[6]
    mBtwSquotes <- "(?<=')[^']*[^']*(?=')"
    filepath <- regmatches(string,regexpr(mBtwSquotes,string,perl = T))
  } else if (pmatch("--file=" ,args)) {
    file_arg <- "--file="
    filepath <- sub(file_arg, "", grep(file_arg, args, value = TRUE))
  } else {
    if (type == 'dir') {
      filepath <- '.'
      return(filepath)
    } else {
      filepath <- "error"
      return(filepath)
    }
  }
  if (type == 'dir') {
    filepath <- dirname(filepath)
  }  
  return(filepath)
}

I discovered however that commandArgs() are inherited from the R script i.e. they remain the same for the .Rmd document when it is knit from a script.R. Therefore only the basepath from script.R location can be used universally, not file name. In other words this function when placed in a .Rmd file will point towards the calling script.R path not the .Rmd file path.

Final answer (TL;DR)

The shorter version of this function will therefore be more useful:

here2 <- function() {
  args <- commandArgs(trailingOnly = FALSE)
  if ("RStudio" %in% args) {
    # R script called from Rstudio with "source file button"
    filepath <- rstudioapi::getActiveDocumentContext()$path
  } else if ("--slave" %in% args) {
    # Rmd file called from Rstudio with "knit button"  
    # (if we placed this function in a .Rmd file)
    file_arg <- "rmarkdown::render"
    string <- grep(file_arg, args, value = TRUE)
    mBtwQuotes <- "(?<=')[^']*[^']*(?=')"
    filepath <- regmatches(string,regexpr(mBtwQuotes,string,perl = T))
  } else if ((sum(grepl("--file=" ,args))) >0) {
    # called in some other way that passes --file= argument
    # R script called via cron or commandline using Rscript
    file_arg <- "--file="
    filepath <- sub(file_arg, "", grep(file_arg, args, value = TRUE))
  } else if (sum(grepl("rmarkdown::render" ,args)) >0 ) {
    # Rmd file called to render from commandline with 
    # Rscript -e 'rmarkdown::render("RmdFileName")'
    file_arg <- "rmarkdown::render"
    string <- grep(file_arg, args, value = TRUE)
    mBtwQuotes <- "(?<=\")[^\"]*[^\"]*(?=\")"
    filepath <- regmatches(string,regexpr(mBtwQuotes,string,perl = T))
  } else {
    # we do not know what is happening; taking a chance; could have  error later
    filepath <- normalizePath(".")
    return(filepath)
  }
  filepath <- dirname(filepath)
  return(filepath)
}

NB: from within .Rmd file to get to the containing directory of the file it is enough to call normalizePath(".") - which works whether you call the .Rmd file from a script, commandline or from Rstudio.

r0berts
  • 842
  • 1
  • 13
  • 27
  • 1
    It's not clear to me what's the problem. You can ship the folder and whoever receive it can run it the same way you run it on your PC. Maybe you want to launch it from somewhere else? Please show what does not work. – nicola Dec 26 '20 at 07:44
  • Thanks, I added the explanation and error messages. – r0berts Dec 26 '20 at 09:23
  • 1
    This turned out quite nicely! I didn't check all the different cases you cover but I'm happy you finished what I started :) I will copy it to my answer to signal that this is the final answer. – JBGruber Jan 02 '21 at 19:02

2 Answers2

4

what you asked for

The behaviour of here() isn't really what you want here, I think. Instead, what you are looking for is to determine the path of the source file aka the .R file. I extended the here() command a little to behave the way you expect:

here2 <- function() {
  args <- commandArgs(trailingOnly = FALSE)
  if ("RStudio" %in% args) {
    dirname(rstudioapi::getActiveDocumentContext()$path)
  } else {
    file_arg <- "--file="
    filepath <- sub(file_arg, "", grep(file_arg, args, value = TRUE))
    dirname(filepath)
  }
}

The idea for the case when the script is not run in RStudio comes from this answer. I tried this by pasting the function definition at the beginning of your dataFetcherMailer.R file. You could also think about putting this in another file in your home directory and call it with, e.g., source("here2.R") instead of library(here) or you could write a small R package for this purpose.

final version by r0berts (op)

here2 <- function() {
  args <- commandArgs(trailingOnly = FALSE)
  if ("RStudio" %in% args) {
    # R script called from Rstudio with "source file button"
    filepath <- rstudioapi::getActiveDocumentContext()$path
  } else if ("--slave" %in% args) {
    # Rmd file called from Rstudio with "knit button"  
    # (if we placed this function in a .Rmd file)
    file_arg <- "rmarkdown::render"
    string <- grep(file_arg, args, value = TRUE)
    mBtwQuotes <- "(?<=')[^']*[^']*(?=')"
    filepath <- regmatches(string,regexpr(mBtwQuotes,string,perl = T))
  } else if ((sum(grepl("--file=" ,args))) >0) {
    # called in some other way that passes --file= argument
    # R script called via cron or commandline using Rscript
    file_arg <- "--file="
    filepath <- sub(file_arg, "", grep(file_arg, args, value = TRUE))
  } else if (sum(grepl("rmarkdown::render" ,args)) >0 ) {
    # Rmd file called to render from commandline with 
    # Rscript -e 'rmarkdown::render("RmdFileName")'
    file_arg <- "rmarkdown::render"
    string <- grep(file_arg, args, value = TRUE)
    mBtwQuotes <- "(?<=\")[^\"]*[^\"]*(?=\")"
    filepath <- regmatches(string,regexpr(mBtwQuotes,string,perl = T))
  } else {
    # we do not know what is happening; taking a chance; could have  error later
    filepath <- normalizePath(".")
    return(filepath)
  }
  filepath <- dirname(filepath)
  return(filepath)
}

what I think most people actually need

I found this way a while ago but then actually changed my workflow entirely to only use R Markdown files (and RStudio projects). One of the advantages of this is that the working directory of Rmd files is always the location of the file. So instead of bothering with setting a working directory, you can just write all paths in your script relative to the Rmd file location.

---
title: "Data collection control report"
author: "HAL"
date: "`r Sys.Date()`"
output: pdf_document
---

```{r setup, include=FALSE}
library(knitr)

# in actual program this reads data from a changing online data source
df.main <- mtcars

# data backup
datestamp <- format(Sys.time(),format="%Y-%m-%d_%H-%M")

# create bkp folder if it doesn't exist
if (!dir.exists(paste0("./bkp/"))) dir.create("./bkp/")

backupName <- paste0("./bkp/dataBackup_", datestamp, "csv.gz")
write.csv(df.main, gzfile(backupName))
```

# This is data collection report

Yesterday's data total records: `r nrow(df.main)`. 

The current directory is `r getwd()`

Note that paths starting with ./ mean to start in the folder of the Rmd file. ../ means you go one level up. ../../ you go two levels up and so on. So if your Rmd file is in a folder called "scripts" in your root folder, and you want to save your data in a folder called "data" in your root folder, you write saveRDS(data, "../data/dat.RDS").

You can run the Rmd file from command line/cron with Rscript -e 'rmarkdown::render("/home/johannes/Desktop/myGoodScripts/dataFetcher.Rmd")'.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • This looks great and might just solve it. I will test it and get back in an hour (when I get to my laptop) – r0berts Dec 26 '20 at 12:22
  • 1
    I have tested and it works very nicely. Thanks to your function and explanation I even understand how it works. Being a non-programmer I only wonder what would be the use of `variables` argument/parameter to the `function()` declaration. But I should be able to replicate this in my more complex scripts OK. – r0berts Dec 26 '20 at 14:55
  • 1
    Sorry the `variables` was just a leftover from the function template I used. I corrected it! – JBGruber Dec 26 '20 at 15:13
  • 1
    I thought about it a little more and added an alternative answer that probably meets your need better. – JBGruber Dec 28 '20 at 11:15
  • Thanks Johannes, that is very helpful. I was aware of relative paths when knitting `.Rmd` but the elegance of your function is that it can be just added to the file and it works. I will need to give these scripts to colleagues who are often not aware of what a _path_ is. At the moment I am trying to modify your function so that it would also work when knitting Rmd from Rstudio. I have pasted the modified function in my original question - based on the fact that the second arg is `--slave` when knitting from Rstudio. – r0berts Dec 28 '20 at 12:39
  • Out of interest, @JBGruber, how would you define behaviour of `here()` in a concise way? – r0berts Jan 06 '21 at 10:58
  • 1
    I might be wrong but: `here()` looks for the root directory of a project (e.g., an RStudio project, Git project or other project defined with a `.here` file) starting at **the current working directory** and moving up until it finds any project. If it doesn't find anything it falls back to using the full working directory. I was continually confused by this behaviour since I expected the search to start at the file location rather than the working directory. I'm happy that `.Rmd` files don't have that issue, which make them more useful for me. – JBGruber Jan 06 '21 at 11:34
  • 1
    Thanks, this is precisely what I suspected, but I somehow could not twig easily from user documentation. It is very reassuring to know things in a simple way. Another reason for confusion is that often this is inconsistent. E.g. load `here`, issue `setwd()` in console confirm it with `getwd()` issue `set_here()` for that specific directory, however `here()` and `dr_here()` both still point towards the initial working directory (defaults to home directory). I have to restart R session for `here()` to notice. I suspect this relates to the previous `setwd()` prior R session restart. Confusing. – r0berts Jan 06 '21 at 11:44
  • It's actually said here: https://github.com/jennybc/here_here#the-fine-print but I read that first sentence about 20 times before realising it says working directory. – JBGruber Jan 06 '21 at 15:12
  • Yes, it does say so, but it was not obvious to me too when I was frantically looking for a reliable solution. The introduction on gitHub sounds a bit like here() is the solution to all things to do with paths, but, as I see it, it really is relevant more to projects run from Rstudio or other interface, not plain command line. – r0berts Jan 06 '21 at 16:18
2

Although your question requires the usage of the here package, I propose a solution without the need of it. I think that it's much cleaner and equally portable.

If my understanding is correct, you want your script to be aware of their location. This is fine, but in most cases unnecessary because the caller of your script must know where the script is located to actually call it and you can exploit this knowledge. So:

  • get rid of all the here calls;
  • don't try to determine in your scripts the file location, but just write every path as relative to the root of your folder (just as you do in development).

Next, a couple of options.

The first, minimal, is just not to register to cron the bare Rstudio /path/to/yourfolder/yourscript.R, but rather create a bash script as follow (let's call it script.sh):

#!/bin/sh
cd /path/to/yourfolder
Rscript yourscript.R

and register this script to crontab. You can add to your folder a README file when you instruct to do the above (something like: "extract the folder wherever you want, take note of the path, build a script.sh file and crotab it"). Of course, with Rstudio you can open and run the file in the usual way (setwd and then run it; you document it in the README).

The second is to write an "installer" (you can choose whether a makefile, a simple R script, bash file or whatever), that does the above automatically. It just performs these steps.

  1. Creates a folder under the homedir, something like .robertsProject (notice the dot to be more likely that the directory does not exist).
  2. Copies all the files and directories from your folder to this newly created folder.
  3. Creates a .sh file just as the one above (notice that you know where you are moving the files and their location, so you can write the correct path in the script).
  4. Registers the .sh file to crontab.

Done! Whoever receive the file will have just to run once this installer (you will document how to do it in the README) and they can use your tool.

nicola
  • 24,005
  • 3
  • 35
  • 56
  • Thank you, that is very useful. Maybe I will implement this in the future. However for now I am needing something that is simple enough for other users to open in Rstudio and run - and function without change when called via `cron`. With installer I would encounter the obstacle that most of the colleagues will use windows 10, will not know what _path_ is and will not have admin rights to their computers. So the main aim is for end users not to get flustered if they run it on their machines; if it comes to placing the script on a linux headless server, I will have to do that. – r0berts Dec 28 '20 at 16:48
  • 1
    I can't see how the above solution is more complicated in any way. If you have to set up a cron job, the installer is much easier for yourself and for everybody that want to use it. If the user wants to run the script just with Rstudio, the only thing to do before "run script" would be "set working directory" from the menu, hardly a complicated task. The point is that the user for both running and/or setting the working directory must know where your files are. – nicola Dec 31 '20 at 14:07
  • Thanks Nicola, I think you are completely right in general and I much appreciate your advice. I will definitely try to implement this process later. However the reason for my question was my need to understand what the differences are for "location" of the `.R/.Rmd` file as executed either from command line or Rstudio. Which I now understand having gone through both answers and updating my question. The trouble with installer is - other people who may need to use the script would be using Windows, hence `cron` and `bash` would confuse them and appear near impossible. – r0berts Jan 01 '21 at 18:12
  • 1
    Ok, but I was under the assumption that a Windows user should not be interested in `cron` and so they can safely skip the "installer" part. They just have to open Rstudio and set the working directory. Your difference with your request is that they should go directly to "run script", but I don't see the added complication. Consider that having a "home" directory is something that basically *every* application has and is set during installation. There aren't much examples of programs that rely on the path in which they are located at the runtime. – nicola Jan 01 '21 at 21:47