Questions tagged [swar]

SWAR (short for SIMD Within A Register) is a technique for operating on data stored in an integer register in parallel

Further reading:

19 questions
80
votes
8 answers

Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD

If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the result of one element affecting the result of…
cam-white
  • 710
  • 5
  • 9
23
votes
4 answers

How does this algorithm to count the number of set bits in a 32-bit integer work?

int SWAR(unsigned int i) { i = i - ((i >> 1) & 0x55555555); i = (i & 0x33333333) + ((i >> 2) & 0x33333333); return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24; } I have seen this code that counts the number of bits equals to 1 in…
Gaith
  • 798
  • 8
  • 19
10
votes
1 answer

SIMD-within-a-register version of min/max

Suppose I have two uint16_t[4] arrays, a and b. Each integer in these arrays is in the range [0, 16383], so bits 14 and 15 aren't set. Then I have some code to find the minimum and maximum among a[i] and b[i] for each i: uint16_t min[4], max[4]; for…
swineone
  • 2,296
  • 1
  • 18
  • 32
8
votes
1 answer

How does this color blending trick that works on color components in parallel work?

I saw this Java code that does a perfect 50% blend between two RGB888 colors extremely efficiently: public static int blendRGB(int a, int b) { return (a + b - ((a ^ b) & 0x00010101)) >> 1; } That's apparently equivalent to extracting and…
Boann
  • 48,794
  • 16
  • 117
  • 146
6
votes
4 answers

How to implement SWAR unsigned less-than?

I'm trying to use uint64_t as if it was 8 lanes of uint8_ts; my goal is to implement a lane-by-lane less-than. This operation, given x and y, should produce a result with 0xFF in a lane if the value for the corresponding lane in x is less than the…
Koz Ross
  • 3,040
  • 2
  • 24
  • 44
5
votes
2 answers

Compare 64-bit integers by segments

I have two 64-bit integers x and y. Each of them represents 5 short unsigned integers: the first 10 bits represent the first integer, the next 13 bits represent the 2nd integer, the next 16 bits represent the 3rd integer, the next 14 bits represent…
user2961927
  • 1,290
  • 1
  • 14
  • 22
4
votes
4 answers

Multiplication of two packed signed integers in one

The Stockfish chess engine needs to store, for its evaluation, both an endgame score and a middlegame score. Instead of storing them separately, it packs them into one int. The middlegame score is stored in the lower 16 bits. The endgame score is…
Chayim Friedman
  • 47,971
  • 5
  • 48
  • 77
3
votes
1 answer

Performantly reverse the order of 16-bit quantities within a 64-bit word

I need to do a lexicographic comparison of a small number of small unsigned integers. If there are (for example) 8 8-bit integers, the obvious approach is to byteswap them and do an ordinary integer compare in a GPR. If there are 2 32-bit…
Moonchild
  • 483
  • 1
  • 4
  • 15
3
votes
2 answers

Can a register hold multiple values at a time?

In the case of a 64-bit x86 register, is it possible to hold more than one value at a time in the same register, if the size of an value is small enough such that multiple instructions could fit into a register? For example fitting two 32 bit ints…
Luke Davis
  • 143
  • 7
1
vote
2 answers

SWAR byte counting methods from 'Bit Twiddling Hacks' - why do they work?

Bit Twiddling Hacks contains the following macros, which count the number of bytes in a word x that are less than, or greater than, n: #define countless(x,n) \ (((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255) #define…
Koz Ross
  • 3,040
  • 2
  • 24
  • 44
1
vote
2 answers

Is there an efficient way of doing 32-bit bitwise rotation separately on the high and low 32-bit parts of a 64-bit number?

I'm currently working in C/C++, and I have a uint64_t. I need to do a bitwise rotation on the top 32 bits and the bottom 32 bits separately. So for example, if my input is | | | …
MNagy
  • 423
  • 7
  • 20
0
votes
1 answer

Speed up strlen using SWAR in x86-64 assembly

The asm function strlen receives the link to a string as a char - Array. To to so, the function may use SWAR on general purpose register, but without using xmm register or SSE instructions. The function checks with the bit manipulation: (v -…
HeapUnderStop
  • 378
  • 1
  • 9
0
votes
1 answer

How to check if a register contains a zero byte without SIMD instructions

Given a 64 Bit general purpose register (Not a xmm register) in x64 architecture, filled with one byte unsigned values. How can I check it for a zero value simultaneously without using SSE instructions? Is there a way to do so in a parallel way,…
HeapUnderStop
  • 378
  • 1
  • 9
0
votes
4 answers

Add two vectors (uint64_t type) with saturation for each int8_t element

I was recently faced with a given problem: There are 8 elements in the vector, each is represented by int8_t. Implement an algorithm in x86_64 that will add two vectors (uint64_t type). Adding elements should be done with saturation arithmetic…
0
votes
2 answers

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index…
user20746246
1
2