6

I use future_lapply() to parallel my code on a Linux machine. If I terminate the process early, only one worker is freed and the parallel processes continue to persist. I know I can enter tools::pskill(PID) to end each individual process, but this is tedious as I run on 26 cores.

If there a way to make a system call to linux, from R, to get all the active PIDs?

I set up future_lapply as such:

# set number of workers
works <- 26
plan(multiprocess, workers = works)
future_lapply(datas, function(data) {
  # do some long processes
}

If I terminate the process and run top I will still see: enter image description here

As my parallel sessions are still running.

Update with session information:
version.string R version 3.6.2 (2019-12-12)
future 1.12.0
future.apply 1.2.0

Mxblsdl
  • 484
  • 6
  • 16
  • (author of future here) First of all, lets establish a few more things about your setup. Specifically, what version of R, future, and future.apply are you running? Also,`top`'s `COMMAND` column lists `rsession`, which would suggest you're running this inside the RStudio Console - is that correct? Because of this, I suspect that you've also re-enabled _forked_ processing in RStudio, e.g. you've set R options `future.fork.enable=TRUE`. Is that correct? Knowing all this will help help you in the next step. – HenrikB Jan 08 '20 at 22:48
  • @HenrikB I am running future from R studio, although I was not aware of the variable `future.fork.enable` I was under the impression that multiple rsessions indicates multiple instances of R running outside of R studio. Is there a better way to implement? – Mxblsdl Jan 08 '20 at 23:34
  • thxs for adding session info details. Now I see that you're running rather old versions of future and future.apply, which complicates the discussion and explanation. Is there a reason why you're not updating those? PS. When you update to future (>= 1.14.0), you _will_ see that `future.fork.enable` will effectively become `FALSE` when you run in the RStudio Console, resulting in `plan(multisession)` and not `plan(multicore)` workers. (I actually suspected you were running an old version of future because of this + what `top` outputted). – HenrikB Jan 09 '20 at 00:09
  • Interesting, there was no reason I was using the older version, other than failure to update. I'll update packages and see if I can recreate the issue. – Mxblsdl Jan 09 '20 at 00:17
  • So, after you've updated, you'll have to set `options(future.fork.enable = TRUE)` at the top of your script/example, in order to reproduce your previous behavior. Then, I recommend that you are explicit about using `plan(multicore, workers = works)` [sic!] instead of `plan(multiprocess, workers = works)` which will use `multicore` or `multisession` depending on environment and above option. (I wish I never introduced `multiprocess` in the first place because of ambiguities like this) – HenrikB Jan 09 '20 at 00:24
  • @HenrikB So I updated the packages and re-ran the tests. I see the difference in setting `options(future.fork.enable = T/F)`. To get back to the original questions, if `future.fork.enable = F` there will be one rsession that uses greater than 100% CPU. I can kill this process by just hitting stop from the console. Would you recommend this method for parallelization? Are there efficiency differences between the two methods? – Mxblsdl Jan 09 '20 at 16:29
  • 1
    Since future 1.14.0, `multicore` futures are disabled in RStudio Console because [they are considered unstable there](https://github.com/rstudio/rstudio/issues/2597#issuecomment-482187011). The future package produces an informative warning about this with more details. So, when you use `plan(multicore)`, future will force that to become `plan(sequential)`, **unless you set `options(future.fork.enable = TRUE)`** when `plan(multicore)` is allowed again. Now, I, and AFAIU, RStudio folks, recommend against using multicore processing in RStudio, so you are better of using `plan(multisession)`. – HenrikB Jan 09 '20 at 17:08
  • 1
    The reason for the above "detour" is that I believe it helps you to understand this important difference between multicore and multisession (and the RStudio recommendations against multicore) before understanding what your (limiting) options are when it comes to terminating futures. FYI, multicore is _forked_ parallel processing, so this affects also `parallel::mclapply()`, foreach with `registerDoMC()` and `registerDoParallel(cores=n)`. – HenrikB Jan 09 '20 at 17:10

1 Answers1

5

I hope this helps.

require(future)
works <- 26
plan(multiprocess, workers = works)
future_lapply(datas, function(data) {
  # do some long processes
})

# get all PIDs of the r processess
v <- listenv::listenv()  # requires listenv package
for (ii in 1:works) {
   v[[ii]] %<-% {
         Sys.getpid()
     }
}

for (i in 1:works) {
  #For windows
  system(sprintf("taskkill /F /PID %s", v[[i]]))

  #For Linux
  system(sprintf("kill -9 %s", v[[i]]))
}

Have a great day.

Revanth Nemani
  • 171
  • 1
  • 6
  • 1
    Yes this does the trick. I did have to replace `system(sprintf("kill -9 %s", v[[i]]))` with `tools::pskill(v[[i]])`. May have to do with OS, running Ubuntu 16.04.6. – Mxblsdl Jan 08 '20 at 21:26
  • 1
    Note: please don't use `KILL -9` (SIGKILL) as a first choice. Start by killing nicely (SIGTERM or SIGQUIT) – wildplasser Jan 08 '20 at 23:54
  • As an update, I noticed that some types of futures will block `v <- listenv::listenv() # requires listenv package for (ii in 1:works) { v[[ii]] %<-% { Sys.getpid() } }` from getting called since it is a future itself. I looked into the library 'ps' which has a function ps() that calls a tibble of the `top` output with all of the PID numbers. – Mxblsdl Jan 29 '20 at 21:59