Why is a Core i5-6600 faster at non-square matrix multiplication than a Core i9-9960X?

Question

The following minimal benchmark rebuilds single-threaded code with -O3 -march=native on each machine, multiplying matrices that are either square or highly non-square (one dimension = 2).

#include <Eigen/Core>

#include <chrono>
#include <iomanip>
#include <iostream>

std::string show_shape(const Eigen::MatrixXf& m)
{
    return "(" + std::to_string(m.rows()) + ", " + std::to_string(m.cols()) + ")";
}

void measure_gemm(const Eigen::MatrixXf& a, const Eigen::MatrixXf& b)
{
    typedef std::chrono::high_resolution_clock clock;
    const auto start_time_ns = clock::now().time_since_epoch().count();
    const std::size_t runs = 10;
    for (size_t i = 0; i < runs; ++i)
    {
        Eigen::MatrixXf c = a * b;
    }
    const auto end_time_ns = clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << std::setw(5) << elapsed_ms <<
        " ms <- " << show_shape(a) + " * " + show_shape(b) << std::endl;
}

int main()
{
    measure_gemm(Eigen::MatrixXf::Zero(2, 4096), Eigen::MatrixXf::Zero(4096, 16384));
    measure_gemm(Eigen::MatrixXf::Zero(1536, 1536), Eigen::MatrixXf::Zero(1536, 1536));
    measure_gemm(Eigen::MatrixXf::Zero(16384, 4096), Eigen::MatrixXf::Zero(4096, 2));
}

which can be easily run with that Dockerfile

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y build-essential wget cmake git lshw

RUN git clone -b '3.3.7' --single-branch --depth 1 https://github.com/eigenteam/eigen-git-mirror && cd eigen-git-mirror && mkdir -p build && cd build && cmake .. && make && make install && ln -s /usr/local/include/eigen3/Eigen /usr/local/include/Eigen

#ADD wide_vs_tall.cpp .
RUN wget https://gist.githubusercontent.com/Dobiasd/78b32fd4aa2fc83d8da3935d690c623a/raw/5626198a533473157d6a19a824f20ebe8678e9cf/wide_vs_tall.cpp
RUN g++ -std=c++14 -O3 -march=native wide_vs_tall.cpp -o main

ADD "https://www.random.org/cgi-bin/randbyte?nbytes=10&format=h" skipcache

RUN lscpu
RUN lshw -short -C memory

RUN ./main

wget https://gist.githubusercontent.com/Dobiasd/8e27e5a96989fa8e4f942900fe609998/raw/8a07fee1a015c8c8e47066a7ac92891850b70a14/Dockerfile
docker build --rm .

produces the following results:

Tobias' workstation (`Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz`)

  359 ms <- (2, 4096) * (4096, 16384)
  761 ms <- (1536, 1536) * (1536, 1536)
  597 ms <- (16384, 4096) * (4096, 2)

sysbench --cpu-max-prime=20000 --num-threads=1 cpu run

CPU speed:
    events per second:   491.14

Keith's workstation (`Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz`)

  437 ms <- (2, 4096) * (4096, 16384)
  724 ms <- (1536, 1536) * (1536, 1536)
  789 ms <- (16384, 4096) * (4096, 2)

sysbench --cpu-max-prime=20000 --num-threads=1 cpu run

CPU speed:
    events per second:   591.58

Why is Tobias' workstation faster in 2 of 3 GEMMs compared to Keith's workstation, despite Keith's workstation showing better sysbench results? I'd expect the i9-9960X to be much faster because its -march=native includes AVX512, and the single-core clock speed is higher.

Are the tests running on otherwise idle systems? Are they running the same operating systems? Has the code been profiled on either platform (something like toplev might be interesting)? — Stephen Newell, May 16 '20 at 05:17
You're only using 1 CPU core, and the client chip has better single-core bandwidth than the 16-core "server" microarchitecture. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020). Presumably that's the bottleneck when one dimension is tiny? Also, correct me if I'm wrong, but you built a binary with `-march=native` on the AVX2 machine and then ran that binary on both? Not taking any advantage of AVX512 on the Skylake-X chip. The SKX i9 does have a max turbo of 4.4GHz, vs. max turbo of 3.9GHz on the SKL i5. — Peter Cordes, May 16 '20 at 05:19
@StephenNewell Yes, the machines were idle when running the test. I've also repeated them several times and got consistent results. Both machines are Linux based, and the results in Docker are similar to direct runs. I've not yet used toplev, but will have a look into it. Thanks. — Tobias Hermann, May 16 '20 at 08:29
@PeterCordes Thanks. Yes, I'm only using 1 CPU core on purpose (no OpenMP). Sorry, I should have mentioned that. - I've compiled on both machines individually with `-match=native`. The memory-bandwidth argument makes sense. — Tobias Hermann, May 16 '20 at 08:29

score 1 · Accepted Answer · answered May 19 '20 at 09:13

As suggested by Peter Cordes in his comment, it seems to boil down to memory throughput.

Results of mbw 1000 show it:

i5-6600:

AVG Method: MEMCPY  Elapsed: 0.13856    MiB: 1000.00000 Copy: 7217.059 MiB/s
AVG Method: DUMB    Elapsed: 0.09008    MiB: 1000.00000 Copy: 11101.625 MiB/

i9-9960X:

AVG Method: MEMCPY  Elapsed: 0.14682    MiB: 1000.00000 Copy: 6811.131 MiB/s
AVG Method: DUMB    Elapsed: 0.10475    MiB: 1000.00000 Copy: 9546.631 MiB/s

Why is a Core i5-6600 faster at non-square matrix multiplication than a Core i9-9960X?

Tobias' workstation (Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz)

Keith's workstation (Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz)

1 Answers1

Tobias' workstation (`Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz`)

Keith's workstation (`Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz`)