Is it possible to use SIMD instruction for replace?

Question

I have vector of int and I need to find and replace some elements with specific value. Both of them are the same.
For example: replace 4 to 8 for all elements.

I'm trying direct memory access in loop in c++. But it still to slow for me.

Update:
I'm working with OpenCV Mat object on x86:

for (int i = 0; i < labels.rows; ++i) {
    for (int j = 0; j < labels.cols; ++j) {
        int& label = labels.at<int>(i, j);
        if (label == oldValue) {
            label = newValue;
        }
    }
}

Mat.at() function just return value by pointer in release mode

template<typename _Tp> inline
_Tp& Mat::at(int i0, int i1)
{
    CV_DbgAssert(dims <= 2);
    CV_DbgAssert(data);
    CV_DbgAssert((unsigned)i0 < (unsigned)size.p[0]);
    CV_DbgAssert((unsigned)(i1 * DataType<_Tp>::channels) < (unsigned)(size.p[1] * channels()));
    CV_DbgAssert(CV_ELEM_SIZE1(traits::Depth<_Tp>::value) == elemSize1());
    return ((_Tp*)(data + step.p[0] * i0))[i1];
}

Please [edit] your question to show [the code you have so far](http://whathaveyoutried.com). You should include at least an outline (but preferably a [mcve]) of the code that you are having problems with, then we can try to help with the specific problem. You should also read [ask]. — Toby Speight, Jan 15 '18 at 12:52

score 5 · Answer 1 · answered Jan 15 '18 at 01:07

5

You didn't mention what architecture you're developing for, so it's impossible to tell you which intrinsics to use. Luckily your compiler should be able to auto-vectorize something like

for (int i = 0 ; i < N ; i++)
  foo[i] = (foo[i] == 4) ? 8 : foo[i];

Assuming your data is sufficiently aligned, with -mavx2 -O3 GCC will use vpcmpeqd and vpblendvb.

answered Jan 15 '18 at 01:07

nemequ

16,623
1
43
62

I can see one issue with the code as shown is that you're now writing back values that are unchanged (or rather, a vectorised optimisation may well be doing so). This is going to trigger a cache writeback (see https://stackoverflow.com/questions/47417481/what-specifically-marks-an-x86-cache-line-as-dirty-any-write-or-is-an-explici) which is going to consume memory bandwidth and, by consuming a genrally limiting resource, will often slow things down. If I were to vectorise this with intrinsics I'd only actually write back if I knew that one of the values in the register was being modified. – Tim Jan 15 '18 at 16:47
I have a hard time believing that checking to see if any values actually changed (i.e., making the `vpblendvb` conditional) would be a performance win, but I it's worth a try if you want to use intrinsics instead of relying on the compiler to auto-vectorize. The answer may depend on the frequency of matches… – nemequ Jan 15 '18 at 20:51
1

If your machine supports AVX-512 (unlikely) you could use that; it supports vectors which are 512 bits instead of 256 bits (like avx2), which is already twice as large as the 128 bits available in SSE2 (which the compiler should generate for any x86 target). Depending on how much data you have, the best way to speed it up more would probably be to use multiple threads to operate on different slices of the data simultaneously. With OpenMP this would be almost trivial, but even using pthreads/win32/c11/etc. it is simple enough that it wouldn't be too difficult. – nemequ Jan 15 '18 at 20:56
1

AVX512 is really nice for this: with a masked store you can avoid dirtying the cache line in the no-modification case, without branching. Actually, you could do the same thing with [AVX2 `vpmaskmovd`](https://github.com/HJLebbink/asm-dude/wiki/VPMASKMOV), or AVX1 `vmaskmovps`. (@Tim). Just don't use SSE `maskmovdqu`; it's very slow (and implicitly NT so it flushes the cache line). Modern CPUs (like Haswell / Skylake) run `vmaskmov` efficiently, especially if masked stores aren't forwarding to loads right away. – Peter Cordes Jan 16 '18 at 11:07
@Tim: wrote this up into an answer. – Peter Cordes Jan 16 '18 at 16:17

score 3 · Answer 2 · answered Jan 16 '18 at 16:17

The key to letting the compiler auto-vectorize is to always assign to the element, even if you assign it to itself. (The ternary operator is good here, see @nemequ's answer). This lets the compiler do a read / rewrite of unchanged values, so it can vectorize with a load + compare and blend + store.

The compiler can't invent writes to memory locations that the C++ source doesn't write to, because that could step on writes from another thread. It's not a data race for different threads to read/write adjacent array elements. If another function the compiler doesn't know about was also using a vector-load / blend / store loop with a different search/replace value, their stores would step on each other. So this vectorization strategy only works if the source writes all the elements. The compiler is free to optimize that away (e.g. if it doesn't vectorize).

Comments on the other answer point out the down-side of unconditionally storing: it dirties the cache even if the data doesn't change. If search hits are rare, it could be worth branching to skip the store and save memory bandwidth, especially if multiple threads will be running this over large blocks of memory. Including in multiple instances of the program running on the same machine, but especially in a shared-memory situation.

AVX introduced masked-store instructions which solve this problem. AVX2 vpmaskmovd and AVX1 vmaskmovps both have 32-bit granularity, so you can use them directly for int data. For narrower elements, you could compare+blend with byte or word granularity, then check for changes with dword granularity to generate a mask.

I think the implementation of vpmaskmovd (in Skylake at least) really does avoid dirtying the cache line when the mask is all-0. According to Intel's optimization manual: 11.9 CONDITIONAL SIMD PACKED LOADS AND STORES, with a masked-store -> any reload: If the mask is all 0 the loads do not depend on the masked store. So the store queue knows that an all-zero mask makes the store a no-op.

I haven't tested, but I expect it avoids dirtying the cache line in this case, at least on Skylake (including Skylake-client which doesn't support AVX512; but it does have the microarchitectural features that AVX512 needs, like efficient masked stores). Masked elements are even allowed to touch illegal addresses without faulting, and some CPUs can do that (at least for the all-zero-mask case) without trapping for a microcode assist. So that would mean they have a way to squash the store entirely.

So the asm you'd want the compiler to make (via intrinsics or auto-vectorization) is:

 ;; outside the loop:  ymm4 = set1_epi32(4);  ymm5 = set1_epi32(8);


vpcmpeqd    ymm0, [rdi], ymm4     ; ymm0 = _mm256_cmpeq_epi32
vpmaskmovd  [rdi], ymm0, ymm5     ; store 8 in elements where ymm0 is -1
add         rdi, 32

I haven't benchmarked this to see if it's actually faster (or at least equal when memory bandwidth isn't a bottleneck, which would be an easier microbenchmark to design).

A vpmaskmovd store is only 3 uops on Skylake (p0 + store-address + store-data). It's 4 uops on Haswell.

According to Agner Fog's testing, vmaskmovps-store is 4 uops on Skylake. It's very strange that it doesn't match the integer instruction that behaves identically.

Using a conditional masked store means you don't need the original data, so it allows folding the load into the vpcmpeqd. The load + cmp+blend + store would nee 1 + 2 + 1 instructions, and vpblendvb is 2 uops. (So is vblendps). So masked stores in theory are faster.

vpblendvb on Haswell can only run on port 5, so that would limit you to processing 32 bytes every other clock, instead of one vector per 1.25 clocks (with an infinite unroll). Most of the time 32 bytes per 2 clocks is fine, though, but if your data is hot in L1D cache then it's a bottleneck.

With AVX512, you'd probably implement it the same way, but with AVX512BW you could use the same masked-store strategy for smaller granularity than 32-bit. Compare into k1, and vmovdqu8 [mem]{k1}, zmm8

Without AVX: DO NOT USE SSE maskmovdqu; it's slow, and implicitly NT so it flushes the cache line, and all that. Use load+blend+store.

Is it possible to use SIMD instruction for replace?

2 Answers2

Linked