3

I am using temporarily a Mac (M1) with R but it runs only with one core. I am editing my question to include a reproducible example. I would like to run functions with big data using multiple core.

this is the reproducible example:

library(eulerr)
library(microbiome)
#devtools::install_github('microsud/microbiomeutilities')
library(microbiomeutilities)

data("zackular2014")
pseq <- zackular2014

table(meta(pseq)$DiseaseState, useNA = "always")
pseq.rel <- microbiome::transform(pseq, "compositional")
Make a list of DiseaseStates.

disease_states <- unique(as.character(meta(pseq.rel)$DiseaseState))
print(disease_states)

list_core <- c() # an empty object to store information

for (n in disease_states){ # for each variable n in DiseaseState
    #print(paste0("Identifying Core Taxa for ", n))
    
    ps.sub <- subset_samples(pseq.rel, DiseaseState == n) # Choose sample from DiseaseState by n
    
    core_m <- core_members(ps.sub, # ps.sub is phyloseq selected with only samples from g 
                           detection = 0.001, # 0.001 in atleast 90% samples 
                           prevalence = 0.75)
    print(paste0("No. of core taxa in ", n, " : ", length(core_m))) # print core taxa identified in each DiseaseState.
    list_core[[n]] <- core_m

mycols <- c(nonCRC="#d6e2e9", CRC="#cbf3f0", H="#fcf5c7") 
plot(venn(list_core),
     fills = mycols)

I would like to run the first for loop using multiple cores but I don't know how to write the function including mclapply? thanks

i.b
  • 167
  • 11
  • What commands/functions are you using in your analyses? If you're using data.table syntax, and you have an intel processor, you can compile data.table with openMP enabled to use multiple cores e.g. https://stackoverflow.com/a/65930969/12957340 – jared_mamrot Oct 23 '22 at 23:20
  • I have edited my question to include a reproducible example to get a practical solution on the function I want to run. – i.b Oct 24 '22 at 19:48
  • FWIW, two of the below answers were for the original question (https://stackoverflow.com/revisions/74174426/1), which was quite differently phrased from the updated question – HenrikB Nov 02 '22 at 03:23

3 Answers3

3

Answer: No. It's not possible to magically make R run in parallel everywhere.

To run code in parallel, you have to orchestrate the parallelization yourself using some of the parallel-processing frameworks, cf. https://cran.r-project.org/web/views/HighPerformanceComputing.html.

Some packages provide functions that can run in parallel. For many of them, parallelization can be controlled via an argument, and some via an R option, or another setting. You need to read the documentation for those functions to find out exactly how.

HenrikB
  • 6,132
  • 31
  • 34
3

Unfortunately, that's not how parallel computation works in R. R is not intrinsically multithreaded; in general, the functions from packages like parallel and foreach are used within a particular workflow or set of functions to specify that particular chunks of the computation should be run in parallel. If you're writing your own workflow, you can read (e.g.) the "using foreach" vignette that comes with the package to see how to make R compute in parallel. You would typically use foreach (or tools from one of the other available parallel-computation-providing packages, e.g. parallel, future, or furrr) to tell R to run chunks of your workflow in parallel.

As @HenrikB and @jared_mamrot say, some packages/functions have the possibility of parallelization built into them, e.g the vroom package for loading data will automatically use all of the cores on your machine (this is not terribly well documented, e.g. see here); if you're using data.table, you can get it to use multiple cores (although the details may be complex depending on your OS/chipset).

The other level at which R can relatively easily do parallel computation is if the linear algebra libraries (BLAS/LAPACK) are configured for multithreaded computation; e.g. see here. If your workflow uses a lot of (the right kind of) linear algebra, this can make a lot of difference.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
1

One potential solution is to convert your for-loop into a function and use future_lapply() from the future.apply package for each disease_state, e.g.

#install.packages("eulerr")
library(eulerr)
#BiocManager::install("microbiome")
library(microbiome)
#devtools::install_github("microsud/microbiomeutilities")
library(microbiomeutilities)
library(future.apply)
#> Loading required package: future
library(profvis)

data("zackular2014")
pseq <- zackular2014

table(meta(pseq)$DiseaseState, useNA = "always")
#> 
#>    CRC      H nonCRC   <NA> 
#>     30     30     28      0
pseq.rel <- microbiome::transform(pseq, "compositional")

disease_states <- unique(as.character(meta(pseq.rel)$DiseaseState))

# Original for loop method:
list_core <- c() # an empty object to store information
for (n in disease_states){ # for each variable n in DiseaseState
  #print(paste0("Identifying Core Taxa for ", n))
  
  ps.sub <- subset_samples(pseq.rel, DiseaseState == n) # Choose sample from DiseaseState by n
  
  core_m <- core_members(ps.sub, # ps.sub is phyloseq selected with only samples from g 
                         detection = 0.001, # 0.001 in atleast 90% samples 
                         prevalence = 0.75)
  print(paste0("No. of core taxa in ", n, " : ", length(core_m))) # print core taxa identified in each DiseaseState.
  list_core[[n]] <- core_m
}
#> [1] "No. of core taxa in nonCRC : 11"
#> [1] "No. of core taxa in CRC : 8"
#> [1] "No. of core taxa in H : 14"

mycols <- c(nonCRC="#d6e2e9", CRC="#cbf3f0", H="#fcf5c7") 
plot(venn(list_core),
     fills = mycols)


## multicore future_lapply method:
# make your loop into a function
run_each_disease_state <- function(disease_state) {
  assign("disease_state", disease_state, envir=globalenv())
  ps.sub <- subset_samples(pseq.rel,
                           DiseaseState == disease_state)
  core_m <- core_members(ps.sub,
                         detection = 0.001,
                         prevalence = 0.75)
  print(paste0("No. of core taxa in ", disease_state,
               " : ", length(core_m)))
  return(core_m)
}

#plan(multisession)
list_core <- future_lapply(disease_states, run_each_disease_state)
#> [1] "No. of core taxa in nonCRC : 11"
#> [1] "No. of core taxa in CRC : 8"
#> [1] "No. of core taxa in H : 14"
names(list_core) <- disease_states

mycols <- c(nonCRC="#d6e2e9", CRC="#cbf3f0", H="#fcf5c7") 
plot(venn(list_core),
     fills = mycols)


## profiling both methods
p1 <- profvis({
list_core <- c() # an empty object to store information
for (n in disease_states){ # for each variable n in DiseaseState
  #print(paste0("Identifying Core Taxa for ", n))
  
  ps.sub <- subset_samples(pseq.rel, DiseaseState == n) # Choose sample from DiseaseState by n
  
  core_m <- core_members(ps.sub, # ps.sub is phyloseq selected with only samples from g 
                         detection = 0.001, # 0.001 in atleast 90% samples 
                         prevalence = 0.75)
  print(paste0("No. of core taxa in ", n, " : ", length(core_m))) # print core taxa identified in each DiseaseState.
  list_core[[n]] <- core_m
}

mycols <- c(nonCRC="#d6e2e9", CRC="#cbf3f0", H="#fcf5c7") 
plot(venn(list_core),
     fills = mycols)
})
#> [1] "No. of core taxa in nonCRC : 11"
#> [1] "No. of core taxa in CRC : 8"
#> [1] "No. of core taxa in H : 14"
htmlwidgets::saveWidget(p1, "singlecore.html")
browseURL("singlecore.html")


p2 <- profvis({
# make your loop into a function
run_each_disease_state <- function(disease_state) {
  assign("disease_state", disease_state, envir=globalenv())
  ps.sub <- subset_samples(pseq.rel,
                           DiseaseState == disease_state)
  core_m <- core_members(ps.sub,
                         detection = 0.001,
                         prevalence = 0.75)
  print(paste0("No. of core taxa in ", disease_state,
               " : ", length(core_m)))
  return(core_m)
}

list_core <- future_lapply(disease_states, run_each_disease_state)
names(list_core) <- disease_states

mycols <- c(nonCRC="#d6e2e9", CRC="#cbf3f0", H="#fcf5c7") 
plot(venn(list_core),
     fills = mycols)
})
#> [1] "No. of core taxa in nonCRC : 11"
#> [1] "No. of core taxa in CRC : 8"
#> [1] "No. of core taxa in H : 14"
htmlwidgets::saveWidget(p2, "multicore.html")
browseURL("multicore.html")

Created on 2022-10-25 by the reprex package (v2.0.1)

This cuts your execution time from ~130ms to ~60ms for this example (singlecore.html vs multicore.html), but the time-saving depends on the actual code you're running and the amount of time required to serialise and unserialise the chunks of your data.

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46