Why increasing the number of "cores" makes a difference?

Question

I am new to the concept of parallel computing

(which I am trying to apply on a script in which a loop builds several regression models for about 1000 times and makes predictions each time based on these models' coefficients. The data sets in each case are too big and the models involve dummy codes and weights which slow down the process even further. Hence, I am trying to apply foreach instead of the 'for' loop.)

I am trying to use the doParallel and foreach libraries and set the number of cores with registerDoParallel(). I have a Windows 10 machine. My understanding is that calls like detectCores() and Sys.getenv('NUMBER_OF_PROCESSORS') will return the number of "logical processors" rather than cores:

> detectCores()
  [1] 4

My Task Manager shows these specifications

task manager

I tried to experiment a bit with what is the "right"(?) number of cores I should set with registerDoParallel() and realised that it will accept any number. I experimented a bit further and found out that this would even make a difference. I have adapted the script above from the creators of these two libraries (pg. 3) to compare serial to parallel execution with various number of cores.

x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000

library(foreach)
library(doParallel)

#detectCores()
#Sys.getenv('NUMBER_OF_PROCESSORS') 
registerDoParallel(cores = 4)
getDoParWorkers()

ptimes = numeric(15)
stimes = numeric(15)

for (i in 1:15) {
stime <- system.time({
  r <- foreach(icount(trials), .combine=cbind) %do% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]
stimes[i] = stime
}

for (i in 1:15) {
ptime <- system.time({
  r <- foreach(icount(trials), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]
ptimes[i] = ptime
}

Here's the results, measured as the mean time in seconds for one iteration. It seems to have a sweet spot at 12 "cores".

process     mean sd
sequential: 53.8    5.4
"2-core":     32.3 1.9
"4-core":     28.7   2.6
"12-core":   22.9   0.5
"24-core":   27.5   1.9

I even compared mean performance between, say, "2-core" and "12-core" with t-tests and they are not due to chance.

My questions are:

Is it good practice, based on the above, to be running my scripts in "12-core mode" when using code that can be parallelised?

I want to use a higher-performance computer to run my script; do I need to repeat this process to find optimal(=fastest) performance?

What if I told you that using an optimized distribution like Revolution R could quadruple performance without using any workers? On a quad machine, `svd` on a large array runs 7 times faster because the function itself uses SIMD commands and Intel's math libraries. The code is a lot cleaner too — Panagiotis Kanavos, Jan 09 '17 at 15:38
What if I told you that that was *not* unique to Revolution R? You have been able to combine the Intel MKL with R for at least a decade. And please explain which code is cleaner. — Dirk Eddelbuettel, Jan 09 '17 at 15:39
As for your specific question, CPUs have a lot of tricks apart from SIMD, like caching, prefetching data and hyper-threading. If anything, your timings show that your code doesn't take proper advantage of even two cores — Panagiotis Kanavos, Jan 09 '17 at 15:44
@DirkEddelbuettel didn't say it is. I said that an optimized distribution could quadruple performance. As for cleaner - calling an optimized function like `svd` is definitely cleaner than trying to rewrite the same function to use workers. The combination of SIMD and multicore makes `svd` run 7 times faster on a quad — Panagiotis Kanavos, Jan 09 '17 at 15:48
That is the beauty of it all --- for unchanged R code you get _the same_ performance gains. BLAS and LAPACK are interface and the (better, but commercial) MKL implmentation works with _any_ proper R build. The standard R, Radford Neal's pqR, ... you name it. — Dirk Eddelbuettel, Jan 09 '17 at 15:50
@PanagiotisKanavos It is not my code. It is the code used by the creators of 'doParallel' and 'foreach' libraries to show a "good example of bootstrapping. Let’s see how long it takes to run 10,000 bootstrap iterations in parallel on 2 cores". [link](https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf) — Tony, Jan 09 '17 at 15:56
@Tony `doParallel` and `parallel` use multiple worker *processes* which is an inefficient way to parallelize code. If anything the timings show inefficient parallelization even with 2 cores (~66%). There are many reasons for this - interprocess communication is always more expensive than thread-to-thread, plus worker processes may end up competing for the *same* CPU. Increasing the number of workers can cover up these inefficiencies up to a point. Using larger data chunks can also hide the overhead. The only way to get a definite answer though is to use a profiler — Panagiotis Kanavos, Jan 09 '17 at 16:12

Patric · Accepted Answer · 2017-01-27T00:40:36.533

1

In practice, it will be nice to set the same number of hardware (physical, 2 in your example) cores as computing threads.

More details:

If your workload is compute intensive, more threads (large than hardware cores) will compete the resource and degrade the performance. However, in some case, such as your example, the workload requires much memory access per computations so that there will be the benefit for more threads to hide memory latency. Actually, the CPU is latency orientation and it can hide latency automatically. In your case, more than 2 threads can gain further improvements but not too much.

Therefore, compared with the tuning time (how much threads you should be used?) on the different system in each time of run, it will be better to use # of hardware cores in your parallel computing program.

A good introduction to parallel computing with R in here.

edited Jan 27 '17 at 00:40

answered Jan 25 '17 at 09:27

Patric

2,063
17
18

Thank you for your answer. Just to clarify and make sure I am following your point, would you use the term "computing threads" interchangeably with the term "logical processors" ? – Tony Jan 26 '17 at 15:15
1

@Tony, actually, computing threads is the concepts in software level so it means how many threads/procedures you set such as 2/4/12/24 you have tried. On the other hand, "logical/physical processors" refers to hardware resource which is fixed for a machine, for example, 2 physical cores and 4 logical processors in your machine. Then, we consider how to map computing threads into hardware cores. In here, I recommend the strategy of ONE computing thread to ONE physical core. btw, the similar one in [here](http://stackoverflow.com/questions/28829300/doparallel-cluster-vs-cores/34717363#34717363) – Patric Jan 27 '17 at 00:35

Why increasing the number of "cores" makes a difference?

1 Answers1