2

I currently have the following code:

float a[4] = { 10, 20, 30, 40 };
float b[4] = { 0.1, 0.1, 0.1, 0.1 };
asm volatile("movups (%0), %%xmm0\n\t"
             "mulps (%1), %%xmm0\n\t"             
             "movups %%xmm0, (%1)"             
             :: "r" (a), "r" (b));

I have first of all a few questions:

(1) if i WERE to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

see the selected answer for this post: Are stack variables aligned by the GCC __attribute__((aligned(x)))?

(2) Could the code be refactored at all to make it more efficient? What if I put both float arrays in registers rather than just one?

Thanks

Community
  • 1
  • 1
horseyguy
  • 29,455
  • 20
  • 103
  • 145

6 Answers6

7

Write it in C, use

gcc -S -mssse3

if you have a fairly recent version of gcc.

xcramps
  • 1,203
  • 1
  • 9
  • 9
  • what C code would compile to those sse instructions? do you have an example? – horseyguy Aug 04 '09 at 13:01
  • 1
    float a[4] = { 10, 20, 30, 40 }; float b[4] = { 0.1, 0.1, 0.1, 0.1 }; int foo(void) { int i; for (i=0; i < 4; i++) a[i] *= b[i]; } Compile as shown and examine the .s file. – xcramps Aug 04 '09 at 13:10
1

Does GCC provide support for the __m128 data type? If so that's your best plan for guaranteeing a 16 byte aligned data type. Nonetheless there is __attribute__((aligned(16))) for aligning things. Define your arrays as follows

float a[4] __attribute__((aligned(16))) = { 10, 20, 30, 40 };
float b[4] __attribute__((aligned(16))) = { 0.1, 0.1, 0.1, 0.1 };

and then use movaps instead :)

Bastien Léonard
  • 60,478
  • 20
  • 78
  • 95
Goz
  • 61,365
  • 24
  • 124
  • 204
  • thanks; but as stated in this article http://stackoverflow.com/questions/841433/gcc-attributealignedx-explanation it seems impossible to align arrays that are allocated on the stack? (as opposed to global arrays allocated in .data) – horseyguy Aug 04 '09 at 12:44
  • thanks for the fix Bastien :) Banister ... can you give it a try and see what happens? If that linked to explanation is right then it would be impossible to align things like double correctly, yet they DO get aligned. – Goz Aug 04 '09 at 12:55
  • yes i will soon...I have a feeling the linked explanation is wrong, as everyone in this question seems to imply. thanks everyone! :) – horseyguy Aug 04 '09 at 12:58
1

if i WAS to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

It is required that alignment on the stack works. Otherwise intrinsics would not work. I would guess the post you quoted had to do with the exorbitant value he selected for the alignment value.

to 2:

No, there shouldn't be a difference in performance. See this site for the instruction timings of several processors.


How alignment of stack variables works :

push    ebp
mov ebp, esp
and esp, -16                ; fffffff0H
sub esp, 200                ; 000000c8H

The and aligns the begin of the stack to 16 byte.

Christopher
  • 8,912
  • 3
  • 33
  • 38
1

(1) if i WAS to align the arrays on 16 byte boundaries, would it even work? Since the arrays are allocated on the stack is it true that aligning them is near impossible?

No, it's quite simple to align the stack pointer using and:

and esp, 0xFFFFFFF0 ; aligned on a 16-byte boundary

But you should use what GCC provides, such as a 16 bytes type, or __attribute__ to customize alignment.

Bastien Léonard
  • 60,478
  • 20
  • 78
  • 95
  • thanks for your answer, would you be able to explain to me how you can use 'and' for alignment? i dont quite 'get' it :) – horseyguy Aug 05 '09 at 16:33
  • 1
    Recall that `some_bit and 0 = 0` and `a/16 = a>>4` if a is unsigned. Using `and` like this will set the four least significant bits to zero, and leave the others unchanged. What happens if you divide `esp` by 16, actually? It gets right-shifted by 4, and the four “lost” bits are the remainder. Thus those four bits should be 0, so that `esp` is divisible by 16. What really happens is that it subtracts *at most* 15, so that `esp` % 16 == 0. (Subtracting from `esp` means allocating more space on the stack). – Bastien Léonard Aug 05 '09 at 16:56
1

Using intrinsic is much faster especially with optimization. I wrote simple test and compare both version (asm and intrinsic)

unsigned long long time1;
__m128 a1,b1;


a1=_mm_set_ps(10, 20,30,40);
b1=_mm_set_ps(0.1, 0.1, 0.1, 0.1);
float a[4] = { 10, 20, 30, 40 };
float b[4] = { 0.1, 0.1, 0.1, 0.1 };

time1=__rdtsc();
a1=_mm_mul_ps(a1,b1);
time1=__rdtsc() - time1 ;
printf("Time: %llu\n",time1);


time1=__rdtsc();
asm volatile("movups (%0), %%xmm0\n\t"
                 "mulps (%1), %%xmm0\n\t"
                 "movups %%xmm0, (%1)"
                 :: "r" (a), "r" (b));
time1=__rdtsc() - time1 ;
printf("Time: %llu\n",time1);

Intrinsic version 50-60 processor timestamps Asm Version ~1000 proc timestamps

You can test it on your machine

AlekseyM
  • 68
  • 8
0

About refactoring. You can use intrinsic. Example:

#include <emmintrin.h>

int main(void)
{
    __m128 a1,b1;

    a1=_mm_set_ps(10, 20,30,40);
    b1=_mm_set_ps(0.1, 0.1, 0.1, 0.1);

    a1=_mm_mul_ps(a1,b1);

    return 0;
}

With optimization gcc (-O2 , -O3) it may be work faster then asm.

bluish
  • 26,356
  • 27
  • 122
  • 180
AlekseyM
  • 68
  • 8