1

I'm using a SIMD implementation that replicates the STL's std::transform. All the vectors I'm using are aligned.

When using 3 separate vectors for the transform, the performance of the SIMD transform (which uses _mm512_and_si512) is identical to that of std::transform. However, if instead I use one vector for the two input ranges, I get a 1.33x speedup using SIMD. The speedup is ~3x when I use the same vector for all transform arguments.

std::transform(first1, last1, first2, d_first, 
    [](const auto& a, const auto& b) {return a & b;}); // Identical performance using SIMD

std::transform(first1, last1, first1 + 1, d_first, 
    [](const auto& a, const auto& b) {return a & b;}); // 1.33 speedup using SIMD

std::transform(first1, last1, first1 + 1, first1, 
    [](const auto& a, const auto& b) {return a & b;}); // 3 speedup using SIMD

What is the reason for there to be no performance difference w/ 3 vectors? Is it just coincidence that its performance is identical to non-SIMD?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Throckmorton
  • 564
  • 4
  • 17
  • 4
    Compilers can optimize code to use SIMD instructions. have you checked the assembly to see if that is what *non-simd code* is doing? – NathanOliver Jun 25 '20 at 14:36
  • 1
    Also, did you check if the results are the same in both cases? Your SIMD implementation might assume that no aliasing happens. – chtz Jun 25 '20 at 15:15
  • Did you check alignment? AVX512 does better when input vectors are 64-byte aligned. What compiler and options are you using? GCC's default for `-march=skylake-avx512` is `-mprefer-vector-width=256` (to avoid turbo clock-speed penalties and other effects). How big are your vectors? Do they fit in L1d cache? Are you sure you did warm-up runs so you're not having page faults inside one of the timed regions ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) – Peter Cordes Jun 25 '20 at 21:01
  • @PeterCordes The vectors are indeed aligned, and the google benchmark went through thousands of iterations, so I'm not sure how muche beginning faults would affect this. The vectors were ~1MB – Throckmorton Jun 25 '20 at 21:08
  • Ok, then probably warm-up effects weren't a problem. But you didn't say anything about your testing methodology (or hardware or compiler) in your question. Experience on Stack Overflow has shown that a good fraction of performance questions are explained by basic errors like that, or compiler quirks, so it's important to rule those out by including details in your question. – Peter Cordes Jun 25 '20 at 21:14
  • 1
    Looking at this again, probably the 3rd case blocked auto-vectorization because of overlap. Also, the `first1 + 1` inevitably means not all your loads can be aligned; depending on compiler options, your compiler might have picked a different strategy. You haven't shown element sizes (byte, uint64_t?) so IDK whether 3x is close to the 4x 64-bit elements per vector, or whether you have 32x 8-bit elements. – Peter Cordes Jun 25 '20 at 22:44

0 Answers0