Complex multiply: std::vector 2x fast than Eigen::Matrix when compiled with -Ofast

Question

I wrote some test case on the server, and found Eigen::Matrix is much slower than std::vector. I do not know why?

The server's configuration list below:

cat /proc/cpuinfo

Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 ida arat

The compile command:

g++ -DEIGEN_FFTW_DEFAULT -isystem /toolchain/library/gtest/1.10.0/include -isystem /toolchain/library/glog/0.4.0/include -isystem /toolchain/library/eigen/3.3.7/include/eigen3 -isystem /toolchain/library/eigen/3.3.7/include/eigen3/unsupported -isystem /toolchain/library/boost/1.72.0/include -isystem /toolchain/library/fftw/3.3.8/include -isystem /toolchain/library/opencv/2.4.13.6/include -isystem /toolchain/library/nlopt/2.6.2/include -Wno-unused-local-typedefs -Werror -Wall -std=c++0x -fPIC -march=native -Ofast -DNDEBUG -std=gnu++11 ......

test case list below:

TEST(ComplexMul, MatrixFloat2) {  // test for Matrix * complex_value
    using test_type = float;
    const complex<test_type> kC1(3.4, 4.3);
    Matrix<complex<test_type>, Dynamic, Dynamic, RowMajor> m1, m3(kMatRowNum, kMatColNum);
    m1 = Matrix<complex<test_type>, Dynamic, Dynamic, RowMajor>::Random(kMatRowNum, kMatColNum);

    bpt::ptime tm_begin1 = bpt::microsec_clock::local_time();

    m3 = m1 * kC1;

    bpt::ptime tm_end1 = boost::posix_time::microsec_clock::local_time();
    bpt::time_duration dur1 = tm_end1 - tm_begin1;

    ostream_color::Modifier_C red(ostream_color::FG_GREEN);
    ostream_color::Modifier_C def(ostream_color::FG_DEFAULT);
    cout << red << "ComplexMul.MatrixFloat2 duration: " << dur1.total_milliseconds() << " ms";
    cout << def << endl;

    cout << m3.block(0, 0, 3, 3) << endl;
}

TEST(ComplexMul, VectorFloat) {  // test for std::vector * complex_value
    using test_type = float;
    const complex<test_type> kC1(3.4, 4.3);
    Matrix<complex<test_type>, Dynamic, Dynamic, RowMajor> m1 = Matrix<complex<test_type>, Dynamic, Dynamic, RowMajor>::Random(kMatRowNum, kMatColNum);
    std::vector<std::complex<test_type>> vec1(m1.data(), m1.data() + m1.rows() * m1.cols()), vec3(m1.rows() * m1.cols());

    bpt::ptime tm_begin1 = bpt::microsec_clock::local_time();

    for (size_t i = 0; i < vec1.size(); i++) {
        vec3[i] = vec1[i] * kC1;
    }

    bpt::ptime tm_end1 = boost::posix_time::microsec_clock::local_time();
    bpt::time_duration dur1 = tm_end1 - tm_begin1;

    ostream_color::Modifier_C red(ostream_color::FG_GREEN);
    ostream_color::Modifier_C def(ostream_color::FG_DEFAULT);
    cout << red << "ComplexMul.VectorFloat duration: " << dur1.total_milliseconds() << " ms";
    cout << def << endl;

    cout << vec3[0] << endl;
}

Test result list below: enter image description here

Thanks Alan. eigen is header only, it is compiled within the test project, no library link needed. — roderick, Dec 17 '20 at 08:43
Did you try the master branch of Eigen? (Eigen 3.3.x does not fully support AVX512) — chtz, Dec 17 '20 at 12:50
Using column-major storage fixes this: https://godbolt.org/z/87afh6 Anyway, Eigen should perform similar using row-major. — dtell, Dec 17 '20 at 22:32
@chtz Thanks for your infomation. Actually the source project is compiled with SSE, no AVX512, for all matrix/vector are created without 32/64Bytes align in source project and other libs it used. — roderick, Feb 03 '21 at 06:51

Christopher Mauer · Answer 1 · 2020-12-17T18:48:50.113

I've written and rewritten (and rewritten) a linear algebra library several times, and I used Blaze, Eigen, and ETL as example code bases. There are a couple of things to check on:

Do you have any of the external libraries linked into your test and configured properly? For instance, almost every smart expression template library is able to use Intel's MKL under the hood where appropriate. Eigen is no different. https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html
Do you have SIMD enabled? On GCC/clang, you can check to see if if SSE/AVX/etc are enabled with the __AVX__ and company macros. You may also need to compile with special compiler flags to enable them. You'll need to be especially crafty if you are using msvc, since they do not provide a lot of the desired functionality here. msvc does support SIMD intrinsics, but they don't provide compilation flags. You can see what your cpu supports using the cpuid instruction on both Intel and amd. There are some additional considerations to keep in mind aside from the cpu support, but I'll let you read the rest here: How to check if a CPU supports the SSE3 instruction set?. After typing this out, I noticed your compiler flags and see you have access to avx512 and are compiling with that flag. This should mean it's enabled, but I thought I'd leave this so you can verify it.
Is this result repeatable with a statically-sized matrix? The reason I ask, is that perhaps there is a cache miss problem in the Eigen library. Using statically size matrices could guarantee that you have contiguous, aligned storage and aren't just looking at a bunch of cache misses because your matrix is sparse.
Eigen's abstractions are not zero-cost, and quite honestly aren't amazing at small matrix sizes. My benchmarks on my machine show blaze and etl to be light-years better (with etl even gaining a lot of gpu support). That said, Eigen is better than most. Try larger matrices than what you're using, but small enough to fit in cache. Small matrices barely account for a single AVX512 load instruction, and you're likely to be paying more for using SIMD due to the overhead associated with the logic around it. Larger matrix sizes will really let the SIMD shine.
Do you have multithreading enabled? Eigen is able to use multithreading, but this will likely hurt you more than help you with small matrices.
How repeatable is this? What's the mean and standard deviation of... Idk, 10,000 iterations? I'm unfamiliar with your test API, but the pseudo code should look something more like this:
```
 // Make sure the cache lines are hot
 C = B * A;
 start_timer();
 for (i = 0; i < 10000; ++i)
     C = B * A;
 stop_timer();
 mean = time / 10000;
```

You can rerun the above several times to compute a standard deviation of the mean if you need it.

I was going to run an alternative benchmark, but the results would be absolutely meaningless since my machine is so drastically different than yours. I do not have the same instruction set, operating system, stl implementation, or external dependencies. I do not even use the same compiler. Normally, I would say it's fine to somewhat disregard this, but these libraries take advantage of everything they can, and as a consequence, the results you see will in no way reflect what I (or most users) would necessarily see.

EDIT: I ran a benchmark with my setup. My setup currently uses 16 threads on an Intel i9, MSVC compiler, c++latest, and avx512. One thing to point out, is that your matrix size is much, much larger than the cache size of the processor. Therefore, the behavior I expected, was that Eigen should converge to roughly the same performance as other libraries, since loading from RAM becomes the main bottleneck, not the calculation speed.

That said, blaze was performing significantly better in my setup for your particular test case. Even my own libary was doing better in this region (Eigen generally does better than my library in most areas, so this is surprising behavior to me). I am seeing a roughly 2x performance degredation on my machine compared to the raw for loop. I did some digging, and I came across this: https://eigen.tuxfamily.org/bz/show_bug.cgi?id=1765

It seems that my particular setup is plagued by a bug that's specific to c++17+ and msvc. I cannot rectify this easily, since my test setup requires c++20 support. Once you've verified that the optimizations are in place, and your problem still persists, you might try reaching out to the Eigen guys and file a bug report. If I understand the problem correctly, msvc doesn't have a way to forcefully inline assembly and intrinsics. The best you can do is strongly suggest that the compiler inline the code. GCC/Clang should not have this problem though, and I see you're compiling with C++11. Perhaps they understood the bug incorrectly, or it is unrelated to the msvc bug. Either way, the Eigen team could likely benefit from your feedback.

@ Christopher, thanks for your comprehensive answer. And I learned more from what wrote. 1, Intel's MKL included, but I'm not sure it is configured properly. I will check it later. 2, There's some infomation missed for the question: the target Eigen::Matrix is 8192*8192, the std::vector is also **8192*8192**. 3, CentOS is installed in the Intel Xeon server which I used, g++ 5.4 is used with C++11. I have checked that __SSE__/.../__AVX512__ are defind by adding compile flag ***-march=native***. — roderick, Dec 17 '20 at 11:21
4, You sai "Eigen's abstractions are not zero-cost", that is what exact I want to know that how much it will be? 5, The test cases are repeatable, and I have run them serverial times. — roderick, Dec 17 '20 at 11:21
What's more: 1, When compiled with **-O2 NDEBUG**, multiply on std::vector slower than Eigen::Matrix for soft math function call(**__mulsc3**). 2, Chang compile flag has no effect on Eigen::Matrix. — roderick, Dec 17 '20 at 11:25
One way I've used to debug things like this in my own library is to look at the assembly output. How does the assembly between your two tests compare? Perhaps that could give you a clue. Maybe std::vector is generating SIMD. Or maybe Eigen is not branching into the MKL appropriately. Or perhaps Eigen just has suboptimal std::complex support. I can verify the latter in my test setup later, but head my warning above regarding apples and oranges. — Christopher Mauer, Dec 17 '20 at 14:03
@roderick, see my edit for further suggestions. I am leaning toward possibly confirming your problem is internal to the library, but as I said, it is apples and oranges with this stuff sometimes. — Christopher Mauer, Dec 17 '20 at 18:51
sorry for late reply for I doubt this can be solved :( , so I stopped the task. — roderick, Feb 03 '21 at 07:06
Not to reopen the issue, but can you use another high performance library? I personally like blaze and ETL if you're open to suggestions. — Christopher Mauer, Feb 03 '21 at 23:40
Thanks for your expert suggestion! I'll try blaze and ETL. And I will close this topic. — roderick, Feb 04 '21 at 02:39

Complex multiply: std::vector 2x fast than Eigen::Matrix when compiled with -Ofast

1 Answers1