1

I encountered a performance problem that puzzles me. For my studies, I work on a diffusion limited aggregation simulation (basically a fancy random walker). To speed things up I parallelized my code using parallel streams map so that multiple walkers are spawned and walk independently until they hit something and then return their position. The performance scaled quite good on my laptop using 1-7 threads.

Now I wanted to do bigger simulations. So I got my self a bigger machine. The result was a massive performance decrease. I compared both systems and my laptop with an Intel I7-4712HQ (8 threads, Geekbench 12k) was three times faster than my server with 4x Intel E7-4870 (80 threads, Geekbench 35k).

I checked the load using htop during the runtime and the laptop showed an average of 8 versus 70 on the server, so the cores are utilized and not idling.

Can that actually be right? Both machines are running Ubuntu and Oracle java 8. It would be greatly appreciated if someone has a suggestion where to look for a mistake.

Bests

ps. I can post the code if needed or provide more details

shmosel
  • 49,289
  • 6
  • 73
  • 138
SiOx
  • 478
  • 1
  • 6
  • 13
  • 2
    Is your server **only** running your application, or is it running a large number of applications? – Elliott Frisch Oct 24 '18 at 20:31
  • One core can only deal with a certain amount of threads at a time (your CPU has 4 cores and each core can deal with 8 threads) so if you reach that limit your application cannot get any faster since the CPU will rotate between the threads. I dont know tough if thats the source of the problem. – Glains Oct 24 '18 at 20:39
  • @Elliott No the server is **only** running my application nothing else. – SiOx Oct 24 '18 at 20:39
  • @Glains the threads are spawned according to the core count: No_threads = (Cores x 2 - 1) – SiOx Oct 24 '18 at 20:41
  • Depends on the dataset and the code. If the dataset is being processed fast enough then 8 threads on same CPU with much faster cache can process it faster than the time it takes all 70 threads to startup and run with possibly slower multiprocessor cache. – tsolakp Oct 24 '18 at 20:48
  • @tsolakp can you elaborate on that? Do you mean the caching from ram -> CPU or anywhere -> ram. Just for clarification, all threads are working on the same 12 GB Nd4j array in the ram, there is virtually no external data involved. – SiOx Oct 24 '18 at 21:21
  • @SiOx I meant L1 and L2 cache on CPU. – tsolakp Oct 24 '18 at 21:23
  • Parallelism is _not_ universally a speedup. It is _often_, even _usually_, a slowdown. – Louis Wasserman Oct 24 '18 at 21:28
  • @tsolakp Alright. Sounds interesting. Do you have an idea how to measure or verify this? – SiOx Oct 24 '18 at 21:36
  • @LouisWasserman yes, I know, but this problem is perfect for parallization as threads are indipendent and have to be rarely synchronized. For this reason the speed-up on the laptop scaled nearly perfectly: 2 threads : 1 thread = (nearly) doubled perf; 4 : 2 = again (nearly) doubled perf and so on. – SiOx Oct 24 '18 at 21:38
  • @SiOx. This might be helpful: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – tsolakp Oct 24 '18 at 21:49
  • @SiOx: Try with 8 threads on the 80-thread platform? – Ry- Oct 24 '18 at 22:37
  • In the laptop all threads share the same cache, whilst in the other machine there's going to be cache-page collisions that'll require going back to RAM. – Perdi Estaquel Oct 25 '18 at 02:04
  • @all just a quick update. First of all - thank you guys for the help! I am now in the process of benchmarking the code for different thread counts and will update the question asap. At the weekend I will also find some time to get started with the Java Microbenchmark Harness to, hopefully, provide some more details to get to the bottom of this. – SiOx Oct 25 '18 at 20:28

0 Answers0