Find index of unaligned int or long in byte array using SIMD

Question

I have a byte sequence that I want to scan to find index of an integer (or long) value. It can be at any byte offset, not necessarily a multiple of the size. Specifically I am interested in first occurence but an example for all indexes will also be helpful.

If it's not possible I guess I need to convert long into Vector<byte> of 8 byte length than compare two.

Platform is X86. I can constrain app to run only x64 mode.
I need fastest possible way so a code snippet would be great.
I know its an easy question but couldn't find an example (in C# at least).

You haven't constrained the situation sufficiently. Does *any* occurrence of the byte sequence representing the long or int value suffice or should it match alignment constraints? What endianness considerations are appropriate here? — Damien_The_Unbeliever, Jul 30 '22 at 18:26
"i need fastest possible way" - no, you do not. That is almost never an actual requirement. Realistic systems have performance *goals* and are measured against those goals, and so long as you're meeting those goals, it shouldn't matter how you achieve them. Readability almost always trumps fast. — Damien_The_Unbeliever, Jul 30 '22 at 18:28
@Damien_The_Unbeliever it will be executed billions of times. and i almost always need this kind of stuff. having a utility method which is proven correct and encapsulates comlexity wont hurt nobody. — ömer hayyam, Jul 30 '22 at 18:29
"It will be executed billions of times" - well, that could be interesting. But over what timescale? Over the entire lifetime of the universe, that could mean years between executions. You need to realize that setting *specific* goals trumps any "as fast as possible" aspirations. — Damien_The_Unbeliever, Jul 30 '22 at 18:32
Damien_The_Unbeliever edited question. its x86. and what you mean by alignment? byte array itself is of course obviously aligned. the position of long obviosly can be anywhere within. — ömer hayyam, Jul 30 '22 at 18:32
@Damien_The_Unbeliever it will be running 7/24 and wanting reasonably fastest possible way shouldnt hurt anybody. — ömer hayyam, Jul 30 '22 at 18:34
By alignment, I mean should 3 rubbish bytes, then 4 bytes that match your required encoding be a match, or should we only be considering a set of 4 bytes that appear at 4 byte boundaries be considered. These are *basic* considerations. — Damien_The_Unbeliever, Jul 30 '22 at 18:35
@Damien_The_Unbeliever than i would consider searching Vector right? btw array length is not multiple of 8 but it can be truncated and manually searched i guess. — ömer hayyam, Jul 30 '22 at 18:37
This is basically a substring-search like C `strstr` or actually `memmem`, for a fixed-length substring. SSE4.2 `pcmpistri` might possibly be useful, although you might need to use the slower `pcmpestri` if you can't guarantee your array doesn't have any `0` bytes. (Or I guess also the key you're looking for.) https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 — Peter Cordes, Jul 30 '22 at 18:40
@Damien_The_Unbeliever: If someone's interested in manually vectorizing something with SIMD, it's not helpful to nitpick over wording like "fastest possible" vs. "fastest asm you can conveniently get the .NET JIT to make"; that's kind of implicit. This is a self-contained enough problem that's big enough to optimize on its own (instead of trying to do it on the fly as part of a larger loop), and interesting and non-trivial. And not something CPUs can do easily, so I'd expect some manual effort coming up with a clever strategy could pay off and save some CPU time. — Peter Cordes, Jul 30 '22 at 18:46
@PeterCordes - when they haven't nailed down the requirements yet, it is. Because they're not yet at the point of knowing it's a place worth optimizing.. I recognize we may disagree here but do you see where I'm coming from? — Damien_The_Unbeliever, Jul 30 '22 at 18:50
@Damien_The_Unbeliever: The request for clarification on alignment details was important and necessary, and we did eventually get that. It's an interesting Stack Overflow question without them having to justify that the effort is warranted in their application. We don't require SIMD questions to justify their existence with a performance requirement, especially not fairly general questions like this which could easily useful to future readers. — Peter Cordes, Jul 30 '22 at 18:57
[What is the fastest substring search algorithm?](https://stackoverflow.com/q/3183582) might be relevant i. Of course the most straightforward thing is just to do 4 loads each offset by 1 byte from each other, for 4x `vpcmpeqd`, then go forward 16 or 32 bytes and do it again. If you had an 8-byte needle to search for, that's 8 different offsets, so a lot more overlap, and maybe worth checking for the first 2 bytes appearing at either of 2 offsets as a quick early-out for most vectors. (With `vpcmpeqw`) — Peter Cordes, Jul 30 '22 at 19:00
@PeterCordes do you say long is not special and i should treat it like a 8byte array? — ömer hayyam, Jul 30 '22 at 19:09
Right, integers are just sequences of bytes. There are instructions for doing math on them as a whole, but that's not what you're doing. You're looking for a sequence of 8 bytes, anywhere in your byte array. Since you don't know it's stored at a multiple-of-8 start position. Of course 8 is a special length, there are instructions like SSE4.1 [`pcmpeqq`](https://www.felixcloutier.com/x86/pcmpeqq) that compare in 8-byte chunks, and similarly for 1, 2, and 4-byte lengths. But the fact that those 8 bytes represent an integer value instead of a string or double is irrelevant. — Peter Cordes, Jul 30 '22 at 19:14
@PeterCordes "instructions like SSE4.1 pcmpeqq" probably this is what im looking for. And a C# code snippet could probably save me hours because currently i have no time to learn even basics and jumping in the middle seems to be mess. i will search comparing 2 vectors and finding index though. — ömer hayyam, Jul 30 '22 at 19:32
Brute force with just `pcmpeqq` at all 8 possible alignments wouldn't be very efficient, though. That instruction isn't a substring-search, it's just two (or four with AVX2) separate 8-byte compares for exact equality between corresponding 8-byte elements of SIMD vectors. That's why I was suggesting looking for 2-byte match candidates before doing the rest of the work, since x86 can branch fairly efficiently on SIMD compare results. That's why this question isn't just a duplicate or trivially answerable. — Peter Cordes, Jul 30 '22 at 19:37
Do you have any information on which part of your input integers are most likely to be unique, to minimize partial-match candidates? Like are they usually small integers with interesting low bits? Or are they often things like `0x12340000` where the first 2 bytes are all zero, and might appear all over the place in your array? (You did say you want this to be as fast as possible, and 8-byte needles are long enough for simple brute-force not to be a great strategy. For 4-byte `int`, sure, but not 8-byte `long`) — Peter Cordes, Jul 30 '22 at 19:39
Nope integers are fixed value but could be anything. Maybe finding all positions of first 4 bytes and comparing rest manually would beat generic vector comparision. Problem is i have no understanding in this area and cant proceed and see myself. — ömer hayyam, Jul 30 '22 at 19:46
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246918/discussion-between-omer-hayyam-and-peter-cordes). — ömer hayyam, Jul 30 '22 at 19:51
`MemoryExtensions.IndexOf` is probably one of the fastest methods — Charlieface, Jul 31 '22 at 00:35
@Charlieface ive investigated source code, PeterCordes is right, if T is byte or char (1 or 2 bytes long) then they take special path and use vectorisation. otherwise they fallback to kind of brute force. as a sidenote, i guess by far biggest bottleneck is accessing memory so not worth optimizing more. — ömer hayyam, Jul 31 '22 at 17:11

ömer hayyam · Accepted Answer · 2022-08-31T21:37:50.283

1

After reading this blog post i've found a way to do what i want to achieve. I knew i will already have enough junk at the end so i omited to check rest but it can be modified to cover all possible inputs.

        static void Main(string[] args)
    {
        var input = "12345671asdasdasd1asdasdasd2asdasdasd3asdasdasd_12345678asdasdasd1asdasdasd2asdasdasd3asdasdasd_"u8;
        var needle = BitConverter.ToInt64("12345678"u8);
        var ix = IndexOf(input, needle);

    }

public unsafe static int IndexOf(ReadOnlySpan<byte> input, long needle)
    {
        fixed (byte* pInput = input)
        {
            int n = input.Length;
            var vecSearch = Avx2.BroadcastScalarToVector256(&needle);
            for (int i = 0; i <= n - 40; i += 32)
                for (int j = i; j < i + 8; j++)
                {
                    var vecInput = Avx2.LoadVector256((long*)(pInput + j));
                    var mask = Avx2.CompareEqual(vecInput, vecSearch); //
                    var imask = Avx2.MoveMask(mask.AsByte());
                    if (imask != 0)
                    {
                        return j + (int)Bmi1.TrailingZeroCount((uint)imask);
                    }
                }
        }
        return -1;
    }

Here is 32bit version:

public unsafe static int IndexOf32(ReadOnlySpan<byte> input, long needle)
{
    int n = input.Length;
    var vecSearch32 = Avx2.BroadcastScalarToVector256((int*)&needle);
    var vecSearch64 = Avx2.BroadcastScalarToVector256(&needle);
    fixed (byte* pInput = input)
    {

        for (int i = 0; i <= n - 44; i += 32)
            for (int j = i; j < i + 4; j++)
            {
                var vecInput = Avx.LoadVector256((int*)(pInput + j));
                var mask = Avx2.CompareEqual(vecInput, vecSearch32);
                var imask = Avx2.MoveMask(mask.AsByte());
                if (imask != 0)
                {
                    var mask1 = Avx2.CompareEqual(vecInput.AsInt64(), vecSearch64).AsInt32();
                    var mask2 = Avx2.CompareEqual(Avx.LoadVector256((long*)(pInput + j + 4)), vecSearch64).AsInt32();
                    var blend = Avx2.Blend(mask1, mask2, 0xaa);
                    imask = Avx2.MoveMask(blend.AsByte());
                    if (imask != 0)
                        return j + (int)Bmi1.TrailingZeroCount((uint)imask);
                }
            }
    }
    return -1;
}

Here is 16bit version which i couldnt complete:

public int AVXShort()
{
    int n = input.Length;
    fixed (byte* pInput = input)
    fixed (byte* pSearch16 = search16)
    fixed (byte* pSearch64 = search)
    {
        var vecSearch16 = Avx.LoadVector256((short*)pSearch16);
        var vecSearch64 = Avx.LoadVector256((long*)pSearch64);
        for (int i = 0; i <= n - 46; i += 32)
            for (int j = i; j < i + 2; j++)
            {
                var vecInput = Avx.LoadVector256((short*)(pInput + j));
                var mask = Avx2.CompareEqual(vecInput, vecSearch16);
                var imask = Avx2.MoveMask(mask.AsByte());
                if (imask != 0)
                {
                    var mask1 = Avx2.CompareEqual(vecInput.AsInt64(), vecSearch64).AsInt16();
                    var mask2 = Avx2.CompareEqual(Avx.LoadVector256((long*)(pInput + j + 2)), vecSearch64).AsInt16();
                    var mask3 = Avx2.CompareEqual(Avx.LoadVector256((long*)(pInput + j + 4)), vecSearch64).AsInt16();
                    var mask4 = Avx2.CompareEqual(Avx.LoadVector256((long*)(pInput + j + 6)), vecSearch64).AsInt16();
                    var res1 = Avx2.Blend(mask1, mask2, 0xa).AsInt16();
                    
                    var res2 = Avx2.Blend(mask3, mask4, 0xa0).AsInt16();
                    var res4 = Avx2.Blend(res1, res2, 0x0);
                    imask = Avx2.MoveMask(res1.AsByte());
                    if (imask != 0)
                        return j + (int)Bmi1.TrailingZeroCount((uint)imask);
                }
            }
    }
    return -1;
}

edited Aug 31 '22 at 21:37

answered Aug 30 '22 at 20:31

ömer hayyam

173
6

It would be faster to check for any matches in `mask` (with `vpmovmskb`, C++ `_mm256_movemask_epi8` and check for non-zero; I don't know the C# intrinsic for it, but there should be one in Avx.something or Avx2.something). The common case is probably no matches. You want to move on to the next vector quickly, not extract 64-bit integers from the compare mask one at a time. Also, a nested loop would probably be more efficient than an if()else to update `i` by an extra 32 every 8th iteration, unless the compiler unrolls that for you. – Peter Cordes Aug 30 '22 at 20:57
@PeterCordes is it enough to add Avx2.MoveMask(mask) > 0 before actual checking and extracting result? you could write best version in c++ and i can try to convert it to c# – ömer hayyam Aug 30 '22 at 21:25
Once you movemask, you just check the bits in it, like `mask & 1`, `mask & (1<<8)`, etc. Or bit-scan it with an intrinsic for `tzcnt`. But as long as the main loop is just checking the movemask result against zero, that's just optimizing the cleanup once you find a match. – Peter Cordes Aug 30 '22 at 21:32
i think mask & 1 == 1 doesnt tell anything because rest bits of the byte in question can be all zeroes so we have to check anyway. but if mask & 1 == 0 we can be sure that byte has no match and skip whole ulong test?. regarding to cleanup i didnt understan anything. 2min of c++ implementation will make me sleep better – ömer hayyam Aug 30 '22 at 21:44
i need to test mask > 0 first then imask & 0xFF000000 = 0xFF000000 && mask.AsUInt64().GetElement(0) == 0xFFFFFFFFFFFFFFFF ?? – ömer hayyam Aug 30 '22 at 21:49
No, I mean after checking the whole mask to keep looping if no bits are set. Like `if (mask) { if (mask & 1) return i; else if ( mask & 0x100 ) return i+8; etc. }`. Of course you still want to check the whole mask before checking which element had the first match, that's the biggest advantage of being able to do a quick check to keep looping. In C++, see [Searching for the key using SIMD](//stackoverflow.com/a/67228970) for an example, although I wouldn't recommend that many shuffles per movemask if you're looping on it. Without packing, the `__builtin_ctz` bit-scan would be scaled by 8. – Peter Cordes Aug 30 '22 at 21:49
Or [How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?](https://stackoverflow.com/q/66959237) shows how to use MSVC `_BitScanForward` or GNU `__builtin_ctz`. `_tzcnt_u32` would work, or C++20 `std::countr_zero`. (It's not searching with overlap, but it's doing some thing other than just searching for pairs of bytes. The relevant thing is checking for a match and if so bit-scanning it, otherwise keep looping. That part applies to your code.) – Peter Cordes Aug 30 '22 at 21:58
You don't need `(imask & 0x000000FF) == 0x000000FF`. If `(imask & 0x000000FF)` is non-zero, it will be equal to `0xFF` because it came from `movemask`. But the compiler probably doesn't know that, so you're making it do extra work for no reason. – Peter Cordes Aug 30 '22 at 22:27
But yes, that's the right general idea if you don't care about optimizing the loop exit to use `tzcnt`. – Peter Cordes Aug 30 '22 at 22:33
@PeterCordes "if (imask & 0x000000FF) is non-zero, it will be equal to 0xFF" that's not true in my experiment. i think you forgot we are comparing 8 bytes. ive tested tcz, it gives nonzero values for no matches too. – ömer hayyam Aug 30 '22 at 22:59
Oh, are you using `vpcmpeqb` instead of `vpcmpeqq`? Use 64-bit element size for your compares to avoid false positive matches! In C++ `_mm256_cmpeq_epi64` instead of `epi8`, but C# intrinsics do it by vector type with an overloaded compare function. – Peter Cordes Aug 30 '22 at 23:31
@PeterCordes You're right i've changed vector type to long. MoveMask doesnt accept anything except byte, float, double so ive casted but its probably noop after JITed. hope i dont miss anything to speed this up further. i think this is superb and thank you. it should be used everywhere to compare strings if operand is > 8bytes.. – ömer hayyam Aug 30 '22 at 23:49
There you go, that looks more efficient. Vector casts are free; in the asm they all use the same registers. There are only 3 sizes of movemask: 1-byte (`vpmovmskb` Byte), 4-byte (`vmovmskps` float), and 8-byte (`vmovmskpd` double). Your loop condition might be stopping short, e.g. for a target string of length 35, you wouldn't run the loop at all. I'm not sure about odd sizes like 41 bytes either, whether you could miss a match that ended at the end of the array. So you might either need a separate loop at the end, or use `min(i + 8, n)` as the inner loop bound. – Peter Cordes Aug 30 '22 at 23:55
As I suggested in comments under the question, this brute-force method might not be the best, although it's still better than doing the same algorithm with scalar. Depending on your data, scanning for a matching first 2 or 4 bytes could let you go through the array faster if there aren't a lot of partial matches, only checking 8-byte element size at that offset at a position where there was a 2-byte match. – Peter Cordes Aug 30 '22 at 23:57
@PeterCordes im searching html strings and neither what im looking for is at the end of string. i could search rest with traditional brute force if i ive needed. 2 byte match is highly possible, not only in html but in most regular texts unless its statistically low chance 2 chars to get nearby. dont you think so? – ömer hayyam Aug 31 '22 at 00:03
4 bytes could be better though, ill check when i have more energy. – ömer hayyam Aug 31 '22 at 00:10
Yeah, of course a false positive partial match is possible, but maybe rare enough to help, depending on what you're looking for. It's a tradeoff. In the worst case where every 32-byte vector contains a false positive, you're doing the 2-byte check *and* four 8-byte checks, or one if you take the time to figure out which offset the partial match happened at. But if only 1/10th of the 32-byte vectors contain a false positive 2-byte match, you'd doing about 1/4 the work (2 checks instead of 8) plus another 1/10th to check, plus branch mispredict costs. – Peter Cordes Aug 31 '22 at 00:20
Depends on your data; text may be too regular, leading to a false-positive rate that's too high to be worth it. Also of course, it's extra work on the true-positive case where you exit the loop. If a match within the first 64 bytes is normal, it's probably better to just brute force. – Peter Cordes Aug 31 '22 at 00:22
@PeterCordes im working on 4byte version. false positives make tzc useless and you have to do 8 tests instead. better idea? – ömer hayyam Aug 31 '22 at 00:48
also reading int32's from vector better than getting it directly from input? i think doing it for 2 bytes is totally different approach right? – ömer hayyam Aug 31 '22 at 00:50
If there is an 8-byte full match, it will start at the same place as the 4-byte partial match, assuming you used the first 4 bytes of the search "needle" to look for partial matches. So you could `tzcnt`, but that would introduce extra latency vs. just brute-forcing the two possible offsets when you find a 4-byte partial match. (Or the 4 offsets possible when you find a 2-byte partial match). One of the two vectors to check against is the one you already have, and the other should be loaded from 4 bytes later. You could `vpsrldq` vector-shift instead of another load, but that shifts in 0s. – Peter Cordes Aug 31 '22 at 01:01
@PeterCordes if i understand correctly i will implement above alghoritm just for the first part of the needle right? then there could be 8 false positives if mask is nonzero. (assuming needle can be like abcdabcd) ive found a way to use tzcnt but it requires extra loop and bit shifting and other arithmetics. wouldnt it be better to do 8 seperate traditional checks? – ömer hayyam Aug 31 '22 at 01:13
here is what ive came up with. not tested yet but you get the idea: var tcz = -4; do { tcz = (int)Bmi1.TrailingZeroCount((uint)imask >> (tcz + 4)); if (*(int*)(pInput + j + tcz + 4) == search32_2) return j + tcz; } while (tcz != 32); – ömer hayyam Aug 31 '22 at 01:15
After finding a partial match, you have to do more vector compares. The partial-match search vector should be `_mm256_set1_epi32(4 bytes from the string)`. So your inner loop only has to loop 4 times before moving on to the next 32 bytes, assuming the mask is all-zero at each offset, no partial match. – Peter Cordes Aug 31 '22 at 01:25
I have no idea what you're doing with a loop doing bit manipulation on some mask (from what? `_mm256_cmpeq_epi32` against the same `_mm256_set1_epi64` you're already using? That would be pointless, not ruling out any offsets any more quickly than `cmpeq_epi64`.) Especially unclear when you forgot to use backticks for code formatting, so the `*` characters became italics markdown. – Peter Cordes Aug 31 '22 at 01:27
@PeterGreen ofc im loading vectors with epi32 and loop 4 times.. problem is when doing 64bit, multiple matches was no problem. now i have find all matches to check for the rest of the needle. here again with backticks ``if (imask != 0) { var tcz = -4; do{ tcz = (int)Bmi1.TrailingZeroCount((uint)imask >> (tcz + 4)); if (*(int*)(pInput + j + tcz + 4) == search32_2) return j + tcz; } while (tcz != 32); }`` or i have to do 8 if checks like if(val & 0xF == 0xF and (val + 4)) == needle2 – ömer hayyam Aug 31 '22 at 01:37
Oh, you're doing scalar checks of the high half that follows each partial match? I wasn't suggesting that, that would be slower than doing two `_mm256_cmpeq_epi64` checks on vectors offset by 4 bytes, to see if there was a full match anywhere. And if so, then you know you're leaving the loop, if not you know you're not. You could `vpblendd` the two `_mm256_cmpeq_epi64` results together before `vpmovmskb` so there's only one movemask and branch. So each 4-bit chunk of the mask indicates whether an 8-byte match started there. – Peter Cordes Aug 31 '22 at 01:51
@PeterGreen ill look into that tomorrow. anything special with converting whole thing to 2bytes? also i have to movemask first _mm256_cmpeq_epi32 result before going further and movemask again after vplendd right? – ömer hayyam Aug 31 '22 at 02:12
Yes, two separate movemask checks, one early-out if there's not even a partial match. Nothing special about 2-byte partial-match checking; you could still `vpblendd` in pairs of `vpcmpeqq` results offset by 4 bytes, to reduce scalar branching work. – Peter Cordes Aug 31 '22 at 03:15
@PeterCordes afaiu after doing vpblendd ill have to check if match is in current offset or +4 offset (0x0F or 0xF0) after tzcnt. afterall first i have to check if im at the last index and can go 4bytes forward too (branch to slower path which adds to function size also).. not the mention i have to load 8byte needle for comparition. i think above loop is more efficient given partial matches are rare and first tzcnt will succeed. – ömer hayyam Aug 31 '22 at 10:41
I don't think any of those downsides are true if you're doing what I was picturing, but it's possible I'm missing something in my head. Checking for partial matches of length 4 lets you got twice as fast through parts of the array that don't have any matches. (4 inner iterations per 32 bytes instead of 8). If the first true match is usually farther out than 64 bytes or so, that's probably worth it. – Peter Cordes Aug 31 '22 at 11:20
@PeterCordes im not saying 4 byte match approach is worse. im talking about what to do after you have partial match. i think it will be hard for me to implement your vpblendd at least efficiently without bound checks etc. your code would help tremendously at this point. – ömer hayyam Aug 31 '22 at 11:44
The last location to check for a 4-byte partial match should be 8 bytes before the end of the whole destination array. So the last 32-byte vector ends 4 bytes before the end of the array, so if you find a partial match, there's room for a potential true match. As for `vpblendd`, that should interleave two `vpcmpeqq` results at 4-byte offsets relative to each other. Movemask on that gives you a bitmap of true-match positions which you can simply `tzcnt` if it's non-zero. You don't have to check anything further, you've already had `vpcmpeqq` produce a "true" element. – Peter Cordes Aug 31 '22 at 12:13
@PeterCordes limiting loop bound 4 bytes further is no problem if rest will be tested with scalar brute force etc. but i cant get vbplendd, the mask parameter shuold be 0xF0_0F_F0_0F.... right? then tzcnt should be tested and adjusted accordingly? i guess you dont have access to ide or online editor to write those yourself. it takes much less time this way. – ömer hayyam Aug 31 '22 at 12:26
`vpblendd` is a dword blend; you need `0b10101010`, aka `0xaa` – Peter Cordes Aug 31 '22 at 12:37
@PeterCordes ive added int32 version see my edit. but i really cant manage to complete int16 version. please help – ömer hayyam Aug 31 '22 at 19:54
I just noticed that `search` is 32 bytes long, not 8. I thought you were only looking for one 8-byte sequence. But you load it with `Avx2.LoadVector256`, not a broadcast load of just the first 8 bytes. You want `vpbroadcastq` for `vecSearch64`, and `vpbroadcastd` for `vecSearch32`. (And `vpbroadcastw` for `vecSearch16`). At least if you're implementing what I thought you were, `memmem(array, some_large_size, search, 8)` (https://man7.org/linux/man-pages/man3/memmem.3.html), for the special case of needle-length = 8. If you're looking for a 32-byte key, that's quite different. – Peter Cordes Aug 31 '22 at 20:04
@PeterCordes yes i was repeating needle 4 times myself, now ill convert it to use broadcast instead. actually im looking for both 2, 4, 8 bytes length needles but i use 8byte needles most. i think i can convert the one for 8byte to 4 and 2 easly . – ömer hayyam Aug 31 '22 at 20:16
@PeterCordes i arranged the code to use vpbroadcastq etc. accordingly. i still cant figure out 2bytes version to search 8bytes.. and how is searching for 4bytes needle different? – ömer hayyam Aug 31 '22 at 20:36
@PeterCordes afaics memmem implemented sclar. our version beat straightforward scalar search 4x for 64, 8x for 32bit version ,i want to see 16bit version but i cant proceed other than checking movemasks' 4 times (on average 2) – ömer hayyam Aug 31 '22 at 20:52
Yes, it seems glibc lacks a SIMD implementation of memmem specialized for small power-of-2 search sizes. The only hand-written asm implementation is for s390. I wasn't suggesting actually using it, just making sure I understood what search problem you were implementing. The 32-byte load for `search` was confusing (and the 32-byte string in the caller still is, for future readers, since you didn't change that.) – Peter Cordes Aug 31 '22 at 20:57
Searching for a 2-byte needle seems trivial; just two `vpcmpeqw` per 32 bytes, from even or odd offsets in the haystack. Plain brute force. The only question is whether you do one movemask from the OR or both compare results (and then sort out which of the two compare results actually had the match), or whether you just do 2x movemask and check them separately. – Peter Cordes Aug 31 '22 at 21:04
I guess it's non-trivial to efficiently decide whether the even mask or odd mask had the first match. The slow way is just `min(tzcnt(even_mask), 1+tzcnt(odd_mask))`. A byte-blend to interleave vectors could work (to get a 32-bit number you could bit-scan to find a byte index), but you'd want to avoid doing that for every not-found iteration. Oh right, the set bits would come in pairs, so we can do a scalar merge if `odd|even != 0`, like `tzcnt ( ((odd&0x55555555)<<1) | (even&0x55555555) )`. But that still requires two movemasks. Perhaps SIMD XOR as a NOT, then `vptest` as the loop exit? – Peter Cordes Aug 31 '22 at 21:06
@PeterCordes if i dont miss something you cant find exact match testing even and odd positions. suppose (maybe partial or full) match is at position 30. we have to look forward from 30 to 38 position or we will miss because next iteration starting from 32 wont look backwards. – ömer hayyam Aug 31 '22 at 21:42
If you've checked for a 2-byte needle in the 32 bytes starting at i and i+1, then the next vector to check is `i+=32`. Yes, that overlaps by 1 with the odd check from the previous iteration, just like how you have some overlap in your nested loops for an 8-byte needle. – Peter Cordes Aug 31 '22 at 22:19
@PeterCordes i think you forgot. needle is 8byte. we are testing first 2bytes first to speed things up (to make less iterations per 32). we do actual check if first mask is non zero. just like we did for 4byte search for 8byte needle.. – ömer hayyam Aug 31 '22 at 22:29
I thought we were talking about searching for needles of different sizes, like 2 bytes, since you mentioned that earlier. You keep saying "16-bit version" instead of "16-bit partial match check" so it's ambiguous what you're talking about. Anyway, looking for a 16-bit partial match should be the same; you still have to check every possible start point; I'm not seeing a problem with checking 2 vectors every 32 bytes for partial matches. It only differs in what you do after finding a (partial) match, e.g. some more `vpcmpeqq`. Merging vectors, or bitmaps after movemask, is still useful. – Peter Cordes Aug 31 '22 at 22:42
@PeterCordes i think you didnt look 3rd method above which i couldnt complete and asked for help. i already did what you suggest from start. problem is after finding partitial match. for 32bit it was easy, now i tossed rough rock. – ömer hayyam Aug 31 '22 at 22:51
@PeterCordes making more 4 vector loads, 4 vpcmpeqq's and blending masking possibly multiple times per partitial match seems wasteful to me. can we do better? we already have result of first mask. – ömer hayyam Aug 31 '22 at 23:01

Find index of unaligned int or long in byte array using SIMD

1 Answers1

Linked