4

I just finished installing a desktop computer based on an AMD Ryzen 2700x and 32GB RAM (running Ubuntu 18.04). At work, I have a 3-year-old laptop workstation with an Intel i7-6820HQ and 16GB RAM (running Windows 10).

I installed Anaconda on both platforms and ran a custom Python code which relies heavily on basic numpy matrix operations. The code does not involve any GPU-specific computation (my work laptop does not have any). The Ryzen is running at 3.7GHz, the laptop i7 is running at 3.6GHz. Both systems have been fully updated.

To my surprise, the code runs in 5 minutes on my work laptop, while it requires 10 minutes on the Ryzen desktop!

The latest Ryzen 2700x is supposed to be much faster than a high-end 3-year-old laptop Intel processor, then why would it be 2x slower?

  • Is it due to Ubuntu being sub-optimal in some way as opposed to Windows 10 for the Ryzen?

  • Is it due to Intel being more adequate to Python simulations than AMD?

  • Anything else?

Thanks for your help in understanding what is going on.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
PythonistL
  • 69
  • 1
  • 3
  • 2
    This arguably does not belong on stackoverflow. Try superuser (another stackexchange site). In any case, benchmarking is impossible without code against which to benchmark. If you want help, you'll need to provide a reproducible example. – Him Nov 26 '18 at 23:23
  • Thank you for your reply. I will try to post to superuser: once done, the post on stackoverflow will be suppressed to avoid creating a duplicate. I cannot share the code as I use it for my work, but I'll try to find the time to create a simpler test script for sharing. – PythonistL Nov 26 '18 at 23:27
  • @Scott if the answer is dependent on the code you've written, I'd argue it belongs here. It's only unfortunate that the code isn't shareable, a simple benchmark test that shows the difference would be very helpful. The lack of code is the only knock I have against this question, and the single answer that exists as I write this is illuminating. – Mark Ransom Nov 27 '18 at 03:05
  • @MarkRansom Does it depend on the code? The top answer suggests that this is actually a hardware question. [See this meta](https://meta.stackexchange.com/questions/57998/hardware-questions-and-stack-exchange) – Him Nov 27 '18 at 04:08
  • @Scott Normally Ryzen and i7 are pretty evenly matched, it requires very specific code to produce a 2x difference. – Mark Ransom Nov 27 '18 at 04:13
  • @MarkRansom ¯\\_(ツ)_/¯ – Him Nov 27 '18 at 04:18
  • Also have a look on the blas package you are using in both cases. This can also have a high impact on performance. When running on a MKL backend the Ryzen may not see any AVX2/FMA code at all. https://github.com/fo40225/Anaconda-Windows-AMD – max9111 Dec 04 '18 at 11:50

2 Answers2

11

It's a software issue: by default, anaconda comes with intel's MKL as the backend for BLAS, which will purposefully cripple AMD speed. You can also install the non-MKL version, which uses openBLAS instead, and you'll see a huge performance boost. You don't need to reinstall it, just uninstall numpy and mkl, then install a numpy built with openBLAS.

anymous.asker
  • 1,179
  • 9
  • 14
  • 2
    Another hint: when using MKL with a Ryzen CPU, you can also set an environment variable `export MKL_DEBUG_CPU_TYPE=5`, then MKL will run a lot faster as it will use optimized code paths that work well in Ryzen, and ends up running faster than OpenBLAS (as of 0.3.8). I don't remember exactly where I read it, but I had it in my `.profile` for some time and working very well. – anymous.asker Feb 15 '20 at 17:15
4

numpy matrix operations

Intel Skylake has significantly better FMA throughput (2 per clock 256-bit vector) than Ryzen (2 per clock 128-bit vector or 1 per clock 256-bit vector). See https://agner.org/optimize/ for x86 microarch details. And FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 for a summary including Ryzen.

With data hot in cache, which a well-optimized matmul can achieve with cache-blocking, a good matmul can bottleneck on FMA execution unit throughput.

Or L1d SIMD load/store bandwidth, where Skylake > 2x Ryzen, being able to sustain close to 2x 256-bit load + 1x 256-bit store, while Ryzen can sustain 2x 128-bit cache accesses, up to one of which can be a store.

So it's totally reasonable for the single-threaded or per-core throughput for Intel to be twice that of a Ryzen core, for matmul / FMA throughput.


Are you multi-threading to take advantage of all cores in each machine? 2700x is an 8-core CPU, while 6820HQ is a 4-core chip.

If your workload can / is taking advantage of multiple cores, then maybe it's an L3 cache bandwidth limitation that's making the difference, assuming they're both configured correctly and actually running at 3.6 / 3.7 GHz. Or maybe there's something creating a 4x per-core perf difference.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Thank you very much for this clear reply. My code does not involve multi-threading, so it is obviously due to the FMA throughput. I was really not aware of this (neither were the websites I checked to compare Intel and AMD before buying), too bad I made the wrong choice – PythonistL Nov 26 '18 at 23:42
  • 1
    @PythonistL: numpy can potentially use multiple threads on its own. You should check with `top` or the Windows equivalent how many cores are in use when your code is running. And if it's only 1, you might have big speedups possible from configuring numpy to multi-thread, if the matrices are big enough. – Peter Cordes Nov 26 '18 at 23:47