Scala Futures are slow with many cores

Question

For a study project I have written a Scala application that uses a bunch of futures to do a parallel computation. I noticed that on my local machine (4 cores) the code runs faster than on the many-core server of our computer science institute (64 cores). Now I want to know why this is.

Task in Detail

The task was to create random boolean k-CNF formulas with n different variables randomly distributed over m clauses and then see how at which m/n combination the probability that a formula is solvable drops below 50% for diffent random distributions. For this I have implemented a probabilistic k-SAT algorithm, a clause generator and some other code. The core is a function that takes n and m es well as the generator function, runs 100 futures and waits for the result. The function looks like this:

Code in question

def avgNonvalidClauses(n: Int, m: Int)(implicit clauseGenerator: ClauseGenerator) = {

    val startTime = System.nanoTime

    /** how man iteration to build the average **/
    val TRIES = 100

    // do TRIES iterations in parallel 
    val tasks = for (i <- 0 until TRIES) yield future[Option[Config]] {
        val clause = clauseGenerator(m, n)
        val solution = CNFSolver.probKSat(clause)
        solution
    }

    /* wait for all threads to finish and collect the results. we will only wait
     * at most TRIES * 100ms (note: flatten filters out all
     * None's) */
    val results = awaitAll(100 * TRIES, tasks: _*).asInstanceOf[List[Option[Option[Config]]]].flatten

    val millis = Duration(System.nanoTime - startTime, NANOSECONDS).toMillis
    val avg = (results count (_.isDefined)) /  results.length.toFloat

    println(s"n=$n, m=$m => $avg ($millis ms)")

    avg
  }

Problem

On my local machine I get these results

[info] Running Main 
n=20, m=120 => 0.0 (8885 ms)
n=21, m=121 => 0.0 (9115 ms)
n=22, m=122 => 0.0 (8724 ms)
n=23, m=123 => 0.0 (8433 ms)
n=24, m=124 => 0.0 (8544 ms)
n=25, m=125 => 0.0 (8858 ms)
[success] Total time: 53 s, completed Jan 9, 2013 8:21:30 PM

On the 64-core server I get:

[info] Running Main 
n=20, m=120 => 0.0 (43200 ms)
n=21, m=121 => 0.0 (38826 ms)
n=22, m=122 => 0.0 (38728 ms)
n=23, m=123 => 0.0 (32737 ms)
n=24, m=124 => 0.0 (41196 ms)
n=25, m=125 => 0.0 (42323 ms)
[success] Total time: 245 s, completed 09.01.2013 20:28:22

However, I the full load on both machines (the server averages around at a load of 60 to 65) so there are running enough threads. Why is this? Am I doing something completely wrong?

My local machine has an "AMD Phenom(tm) II X4 955 Processor" CPU the server is uses "AMD Opteron(TM) Processor 6272". The local CPU has 6800 bogomips, the servers 4200. So, while the local CPU is a 1/3 faster, there are 12 times more cors on the server.

Additional

If have a trimmed down example of my code pushed to github so you can try for yourselve if you are intereste: https://github.com/Blattlaus/algodemo (It's an sbt project using Scala 2.10).

Updates

I've eleminated any randomness by seeding the random number generators with 42. This changes nothing
I've changed the testset. Now the results are even more astonishing (the server is 5 times slower!) Note: all outputs for the average percentage of not solvable clauses are zeor because of the input. This is normal and expected.
Added info about CPUs
I've noticed that calls to Random.nextInt() are a factor of 10 slower on the Server. I have wrapped all calls in a helper that measures the runtime a prints is to the console if they are slower then 10ms. On my local machine i get a few, and the typically are araound 10-20ms. On the server I get much mure calls and they tend to be above 100ms. Could this be the issue???

How many other tasks is that 64 core server performing while it tries to execute yours? — Robert Harvey, Jan 09 '13 at 18:45
Your second set isn't returning any results (all `NaN`). Maybe try timing it too see how long it actaully takes to produce results. Also, how fast are the cores compared to your local machine? — Luigi Plinge, Jan 09 '13 at 18:54
@RobertHarvey afaik the default executor for Scala creates 2 threads for each core, so there should be 128 threads and 100 running (each for one loop) — Martin Thurau, Jan 09 '13 at 19:25
To find out whether each core is slow or whether there is outrageous overhead, try wrapping each computation with `val t0 = System.nanoTime` and `val t1 = System.nanoTime` and returning `(t1-t0)*1e-9` along with your answer in the result. — Rex Kerr, Jan 09 '13 at 20:08
I don't know which Futures library you're using, but it would be a good idea to switch to Akka's Future (default for Scala 2.10) if you're using one of the other libraries. — Randall Schulz, Jan 09 '13 at 20:38
Out of curiosity, try using scala.concurrent.Future instead of scala.actors.Future. — Marius Danila, Jan 09 '13 at 21:24
might be a stupid suggestion, but are you sure the server is not limiting amount of cores your process can use? Also please provide JVM and OS versions. — Denis Tulskiy, Jan 10 '13 at 17:44
i wonder if you could use cpusets ( http://man7.org/linux/man-pages/man7/cpuset.7.html ) to run a controlled experiment with different numbers of available cores on a given machine. — Rob Starling, Mar 15 '14 at 18:19
Could the server be running low on entropy from other processes? I don't know if Random is implemented to use the system entropy source or not.. — Daenyth, May 19 '15 at 14:23

score 2 · Answer 1 · edited May 23 '17 at 12:24

You have already figured out the answer in that the problem is Random.nextInt() which uses a AtomicLong(). If this is being accessed frequently from different threads then you will get cache thrashing, which will be worse on your 64 core computer because the caches will be further apart (electrically) and hence it will take longer to get the necessary cache line locks.

See this stackoverflow answer for more details, and the solution on how to avoid this problem (which is basically to use a thread local random number generator): Contention in concurrent use of java.util.Random

score 1 · Answer 2 · edited May 23 '17 at 12:15

1

Operations on denormalized floating numbers, could take an order of magnitude longer on x86 architecture. See:

Why does changing 0.1f to 0 slow down performance by 10x?

Haven't examined your code, but given that you return NaN that might be the case. Try removing randomness from your test to verify that hypothesis.

edited May 23 '17 at 12:15

Community

1
1

answered Jan 09 '13 at 19:09

Jakozaur

1,957
3
18
20

1

I don't think that this is the problem. The only floating point operation is the calculation of the average. All other numbers are either Int or Long (and some bools). – Martin Thurau Jan 09 '13 at 19:34