What's the most efficient way to multiply 4 floats by 4 floats using SSE?

Question

I currently have the following code:

float a[4] = { 10, 20, 30, 40 };
float b[4] = { 0.1, 0.1, 0.1, 0.1 };
asm volatile("movups (%0), %%xmm0\n\t"
             "mulps (%1), %%xmm0\n\t"             
             "movups %%xmm0, (%1)"             
             :: "r" (a), "r" (b));

I have first of all a few questions:

(1) if i WERE to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

see the selected answer for this post: Are stack variables aligned by the GCC __attribute__((aligned(x)))?

(2) Could the code be refactored at all to make it more efficient? What if I put both float arrays in registers rather than just one?

Thanks

score 7 · Answer 1 · answered Aug 04 '09 at 12:56

7

Write it in C, use

gcc -S -mssse3

if you have a fairly recent version of gcc.

answered Aug 04 '09 at 12:56

xcramps

1,203
1
9
9

what C code would compile to those sse instructions? do you have an example? – horseyguy Aug 04 '09 at 13:01
1

float a[4] = { 10, 20, 30, 40 }; float b[4] = { 0.1, 0.1, 0.1, 0.1 }; int foo(void) { int i; for (i=0; i < 4; i++) a[i] *= b[i]; } Compile as shown and examine the .s file. – xcramps Aug 04 '09 at 13:10

score 1 · Answer 2 · edited Aug 08 '09 at 16:16

1

Does GCC provide support for the __m128 data type? If so that's your best plan for guaranteeing a 16 byte aligned data type. Nonetheless there is __attribute__((aligned(16))) for aligning things. Define your arrays as follows

float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

and then use movaps instead :)

edited Aug 08 '09 at 16:16

Bastien Léonard

60,478
20
78
95

answered Aug 04 '09 at 12:38

Goz

61,365
24
124
204

thanks; but as stated in this article http://stackoverflow.com/questions/841433/gcc-attributealignedx-explanation it seems impossible to align arrays that are allocated on the stack? (as opposed to global arrays allocated in .data) – horseyguy Aug 04 '09 at 12:44
thanks for the fix Bastien :) Banister ... can you give it a try and see what happens? If that linked to explanation is right then it would be impossible to align things like double correctly, yet they DO get aligned. – Goz Aug 04 '09 at 12:55
yes i will soon...I have a feeling the linked explanation is wrong, as everyone in this question seems to imply. thanks everyone! :) – horseyguy Aug 04 '09 at 12:58

score 1 · Accepted Answer · answered Aug 04 '09 at 12:44

if i WAS to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

It is required that alignment on the stack works. Otherwise intrinsics would not work. I would guess the post you quoted had to do with the exorbitant value he selected for the alignment value.

to 2:

No, there shouldn't be a difference in performance. See this site for the instruction timings of several processors.

How alignment of stack variables works :

push    ebp
mov ebp, esp
and esp, -16                ; fffffff0H
sub esp, 200                ; 000000c8H

The and aligns the begin of the stack to 16 byte.

score 1 · Answer 4 · answered Aug 04 '09 at 12:45

1

(1) if i WAS to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

No, it's quite simple to align the stack pointer using and:

and esp, 0xFFFFFFF0 ; aligned on a 16-byte boundary

But you should use what GCC provides, such as a 16 bytes type, or __attribute__ to customize alignment.

answered Aug 04 '09 at 12:45

Bastien Léonard

60,478
20
78
95

thanks for your answer, would you be able to explain to me how you can use 'and' for alignment? i dont quite 'get' it :) – horseyguy Aug 05 '09 at 16:33
1

Recall that `some_bit and 0 = 0` and `a/16 = a>>4` if a is unsigned. Using `and` like this will set the four least significant bits to zero, and leave the others unchanged. What happens if you divide `esp` by 16, actually? It gets right-shifted by 4, and the four “lost” bits are the remainder. Thus those four bits should be 0, so that `esp` is divisible by 16. What really happens is that it subtracts *at most* 15, so that `esp` % 16 == 0. (Subtracting from `esp` means allocating more space on the stack). – Bastien Léonard Aug 05 '09 at 16:56

score 1 · Answer 5 · answered Nov 23 '12 at 08:42

Using intrinsic is much faster especially with optimization. I wrote simple test and compare both version (asm and intrinsic)

unsigned long long time1;
__m128 a1,b1;


a1=_mm_set_ps(10, 20,30,40);
b1=_mm_set_ps(0.1, 0.1, 0.1, 0.1);
float a[4] = { 10, 20, 30, 40 };
float b[4] = { 0.1, 0.1, 0.1, 0.1 };

time1=__rdtsc();
a1=_mm_mul_ps(a1,b1);
time1=__rdtsc() - time1 ;
printf("Time: %llu\n",time1);


time1=__rdtsc();
asm volatile("movups (%0), %%xmm0\n\t"
                 "mulps (%1), %%xmm0\n\t"
                 "movups %%xmm0, (%1)"
                 :: "r" (a), "r" (b));
time1=__rdtsc() - time1 ;
printf("Time: %llu\n",time1);

Intrinsic version 50-60 processor timestamps Asm Version ~1000 proc timestamps

You can test it on your machine

score 0 · Answer 6 · edited Nov 23 '12 at 08:06

0

About refactoring. You can use intrinsic. Example:

#include <emmintrin.h>

int main(void)
{
    __m128 a1,b1;

    a1=_mm_set_ps(10, 20,30,40);
    b1=_mm_set_ps(0.1, 0.1, 0.1, 0.1);

    a1=_mm_mul_ps(a1,b1);

    return 0;
}

With optimization gcc (-O2 , -O3) it may be work faster then asm.

edited Nov 23 '12 at 08:06

bluish

26,356
27
122
180

answered Nov 23 '12 at 07:38

AlekseyM

68
8

how much faster do you reckon it would run? could you benchmark? – toxicate20 Nov 23 '12 at 08:08
see next post , i'm test it – AlekseyM Nov 23 '12 at 08:54

What's the most efficient way to multiply 4 floats by 4 floats using SSE?

6 Answers6