3

How should I call external programs from sub-instances of parallelized R? The problem could occur also on other contexts, but I am using library(foreach) and library(doFuture) on slurm-based HPC. As an example, I have created a hello.txt that contains "hello world", and in my R script I have the following lines just before and within the %dopar% {}:

message(getwd())
system("echo 'hello directly'")
system("cat hello.txt")

The result in the .out file of the sbatch run looks like this, after I have asked for two %dopar% iterations:

/lustre/scratch/myuser
hello directly
hello world
/lustre/scratch/myuser
/lustre/scratch/myuser
Error in { : task 2 failed - "cannot open the connection"
Calls: %dopar% -> <Anonymous>

Thus, the main R instance on the login node and the sub-instances on the computing nodes seem to have the same working directory, and dealing with the same files hasn't been a problem earlier with the native R functions. However, executing the system() on the computing nodes fails for some reason. Any help?

Imsa
  • 69
  • 4
  • Thanks @KonradRudolph for pointing it out. I think it was a relic from earlier experimenting. For the sake of clarity I removed the command. – Imsa Aug 15 '23 at 12:36
  • Is `hello.txt` created immediately before your `%dopar%` loop, or has the file existed well (> 10 seconds?) before this script runs? I work on a HPC with a lustre filesystem as well, and I find the lag in file-existence between nodes on the cluster can be upwards of 10 seconds (it's a large cluster). – r2evans Aug 15 '23 at 12:59
  • 1
    @r2evans Oh, that is an important observation, thanks! Maybe not critical for this particular case, but good to know for later purposes. I have manually created the hello.txt well before trying to access it in this example. – Imsa Aug 15 '23 at 13:07
  • Is the failure in the `system(..)` call or in `cat`? For instance, can you instead do `cat(readLines("hello.txt"), "\n")` without failure? My initial thought was a problem with file access, but perhaps it's with `system(.)` itself. – r2evans Aug 15 '23 at 13:13
  • I would call this a feature. You are trying to write to the same file with parallel connections. I don't think that can work. I recommend writing to separate files and combining those in the end. – Roland Aug 15 '23 at 13:22
  • @Roland, the file is created before, and neither of the `system(..)` calls writes to a file. Did I misinterpret the OP? – r2evans Aug 15 '23 at 13:27
  • @r2evans Oh, you are right. It's a read connection. – Roland Aug 15 '23 at 13:30
  • @r2evans I tried and the cat(readLines("hello.txt"), "\n") produces the expected "hello world" three times, one outside and two times within the %dopar%. – Imsa Aug 15 '23 at 13:44
  • @Roland Thanks for the input, that is actually what I am doing in my main project. Notice how the mere system("echo 'hello directly'") from the computing nodes also fails, strongly implying that the problem is in the system() function. Although considering possible file and connection conflicts is also an important topic. – Imsa Aug 15 '23 at 13:44
  • it sounds like the culprit is `system` itself, not the filesystem access. I've heard of this before, it might be a limitation of the docker/container/lxd/virtualized environment in which your job is being executed. (This may not be horrible ... `system` and `system2` are terrible functions, I tend to use the `processx` package for anything that _might_ do something "crazy" like include a space or special character in a command line argument. Try `processx::run("echo", "hello directly")$stdout` and `processx::run("cat", "hello.txt")$stdout` instead. – r2evans Aug 15 '23 at 13:56
  • @r2evans I can try. But before installing new packages, I will open up a little bit more my intentions. My approach is not maybe the most orthodox one, but I would actually like to make singularity exec from the R sessions run on the computing nodes. One way is to utilize commands similar to the system(), but if you happen to know other and possibly better ideas let me know. – Imsa Aug 15 '23 at 14:21
  • @r2evans and the singularity I am using is unfortunately too old (1.1.9-1.el7) so I am not able to utilize the instances/services feature published with the Singularity 2.4. – Imsa Aug 15 '23 at 14:27
  • @r2evans unfortunately the processx commands did not help, here is the relevant part of the .out: – Imsa Aug 16 '23 at 04:29
  • /lustre/scratch/myuser [1] "hello directly\n" [1] "hello world\n" /lustre/scratch/myuser /lustre/scratch/myuser Error in { : task 2 failed - "cannot open the connection" – Imsa Aug 16 '23 at 04:30
  • Other things what I have tried. I started to wonder, does the computing nodes have enough resources to make `system()` calls and eventually open the container with singularity. Thus, in the `plan(future.batchtools::batchtools_slurm...` I put `cpuspertask=8` and `mempercpu=10000`. And in the `batchtools.slurm.tmpl` I also put `#SBATCH --ntasks=8`, but these modifications did not help either. I'm stuck. – Imsa Aug 16 '23 at 05:07
  • The issue is likely to be the system call (that is, OS-level syscall, not R's `system`). R's `base::system` calls the internal `do_system` (C) that calls `R_system` (C) that calls [`system`](https://man7.org/linux/man-pages/man3/system.3.html). Since that is failing, it is possible `system` or `fork` (syscalls) or pipes are disabled in the singularity environment. Seems a bit harsh ... sorry, I don't know a way to resolve that kind of draconian restriction. – r2evans Aug 16 '23 at 12:57
  • @r2evans what is your estimate, are these restrictions baked into the slurm or other software/packages used, or would it be worth to try asking the (busy) system administrators of the HPC to remove the assumed restrictions? I am not sure did I understood the concept of singularity environment, because the mere system() with echo also fails without singularity calls at all. – Imsa Aug 17 '23 at 06:13
  • I don't use Singularity, so I really don't know if/how that configuration can be done (or even "if"). Other questions: (1) How are you defining the "image" for singularity? I know that it can be given a docker image (in which case it unpacks it to a filesystem for use). Do you have control over the docker image itself? (2) Do you know the OS and perhaps even kernel version of the nodes or at least your gateway host(s)? – r2evans Aug 17 '23 at 12:17
  • @r2evans thank you very much for your help. I eventually gave up and tried using array jobs, with a lot of new issues as discussed here: https://stackoverflow.com/questions/76971048/why-singularity-containers-behave-differently-on-login-vs-computing-nodes-on-slu – Imsa Aug 24 '23 at 15:57

0 Answers0