11

I am trying to vectorize the following function with clang according to this clang reference. It takes a vector of byte array and applies a mask according to this RFC.

static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) {
  #pragma clang loop vectorize(enable) interleave(enable)
  for (size_t i = 0; i < payload.size(); i++) {
    payload[i] = payload[i] ^ masking_key[i % 4];
  }
}

The following flags are passed to clang:

-O3
-Rpass=loop-vectorize
-Rpass-analysis=loop-vectorize

However, the vectorization fails with the following error:

WebSocket.cpp:5:
WebSocket.h:14:
In file included from boost/asio/io_service.hpp:767:
In file included from boost/asio/impl/io_service.hpp:19:
In file included from boost/asio/detail/service_registry.hpp:143:
In file included from boost/asio/detail/impl/service_registry.ipp:19:
c++/v1/vector:1498:18: remark: loop not vectorized: could not determine number
      of loop iterations [-Rpass-analysis]
    return this->__begin_[__n];
                 ^
c++/v1/vector:1498:18: error: loop not vectorized: failed explicitly specified
      loop vectorization [-Werror,-Wpass-failed]

How do I vectorize this for loop?

Community
  • 1
  • 1
rahul
  • 2,269
  • 3
  • 28
  • 31
  • What happens if you capture the size of the vector outside the loop and use that for the loop condition? – NathanOliver May 20 '16 at 16:14
  • Tried that already. Does not help! – rahul May 20 '16 at 16:16
  • 1
    This loop looks trivial to vectorize. Have you checked whether the compiler does it implicitly with plain `-03`? – Baum mit Augen May 20 '16 at 16:20
  • 1
    I did and checked with -Rpass-analysis=loop-vectorize flag. It does not vectorize implicitly, which is why I added explicit #pragma flags. – rahul May 20 '16 at 16:24
  • 1
    I wonder if it's an an aliasing issue - can you try applying `restrict` (and/or `const`) to `uint8_t (&masking_key)[4]` ? – Paul R May 20 '16 at 16:30
  • 1
    @PaulR `const` will probably not help as one can have `const&` to non `const` data. `restrict` is worth a shot though. – Baum mit Augen May 20 '16 at 16:32
  • If `restrict` doesn't help then the next thing I would try is unrolling the loop by a factor of 4 and eliminate the `% 4`. – Paul R May 20 '16 at 16:35
  • 1
    Using an `std::array` passed by value for the key would also eliminate all potential aliasing problems. – Baum mit Augen May 20 '16 at 16:37
  • 1
    @BaummitAugen The unrolling of loop worked. I will now submit an answer. – rahul May 20 '16 at 16:53

1 Answers1

5

Thanks to @PaulR and @PeterCordes. Unrolling the loop by a factor of 4 works.

void apply_mask(vector<uint8_t> &payload, const uint8_t (&masking_key)[4]) {
  const size_t size = payload.size();
  const size_t size4 = size / 4;
  size_t i = 0;
  uint8_t *p = &payload[0];
  uint32_t *p32 = reinterpret_cast<uint32_t *>(p);
  const uint32_t m = *reinterpret_cast<const uint32_t *>(&masking_key[0]);

#pragma clang loop vectorize(enable) interleave(enable)
  for (i = 0; i < size4; i++) {
    p32[i] = p32[i] ^ m;
  }

  for (i = (size4*4); i < size; i++) {
    p[i] = p[i] ^ masking_key[i % 4];
  }
}

gcc.godbolt code

rahul
  • 2,269
  • 3
  • 28
  • 31
  • 1
    That was [Paul's idea](https://stackoverflow.com/questions/37351236/vectorize-a-function-in-clang#comment62218869_37351236) actually. :) – Baum mit Augen May 20 '16 at 16:56
  • 2
    Unrolling is the only way to get it vectorized, I tried on all compilers on gcc.godbolt and no one was able to vectorize without unrolling. – CoffeDeveloper May 20 '16 at 16:58
  • Can you link to http://gcc.godbolt.org/ with vectorized output? [The only "vectorization" I'm ever getting with clang is a bunch of `vpaddq`(purpose unknown) and 32x`vpinsrb` / `vpxor` / 32x`vpextrb`.](https://godbolt.org/g/RHrrg9) This is obviously worse than scalar. Even when avoiding aliasing getting the masking keys and `payload.data()` into locals, it still doesn't want to autovectorize (but at least avoids the pointer from payload after every byte store). – Peter Cordes May 20 '16 at 17:48
  • 1
    That's what I found too; I had to put all 4 keys into a `uint32_t`. (But I used memcpy instead of a reinterpret cast; either way optimizes away completely on x86.) Your godbolt link doesn't enable optimizations, and has `static` on the function so we can't see the asm for the function itself. Anyway, [here's what you should have posted for a godbolt link](https://godbolt.org/g/hz2dqW). Don't forget to update the code in your answer, since the code in the code block is still useless. – Peter Cordes May 20 '16 at 18:24
  • @PeterCordes Thanks a lot. This really helps. BTW, do you see any way to optimize this further. Profiling still shows this as the bottleneck. – rahul May 20 '16 at 18:33
  • 1
    @rahul: if you have a Haswell or newer, build with -march=native to let it use AVX2. The asm in the link I posted looks pretty much optimal for SSE, and should sustain one 16B vector per clock *if the input buffer is 16B-aligned*. With `-march=haswell`, it does the same but with 32B vectors. With `-march=sandybridge`, it avoids unaligned 32B loads/stores, instead doing them in 16B halves. (The `vextractf128` to a register instead of directly to memory looks like a bad idea. Same number of fused-domain uops, but competes for port5 with `vxorps` :/) – Peter Cordes May 20 '16 at 18:40
  • @PeterCorder Thanks. I missed issue with alignment. – rahul May 20 '16 at 18:44
  • It looks like you both unrolled, and took a raw pointer instead of calling `std::vector::operator[]`. Are you sure it was the unrolling, and not the dereferencing `this->__begin_` that solved your problem? – xaviersjs Nov 14 '18 at 01:10