4

I was doing the integration task with FPU before, now I'm struggling with SSE.

My main problem is when I was using FPU stack, there was the fsin function, which could be used on the number, which is at the top of the stack (st0).

Now I want to calculate the sine of my all four numbers in XMM0, or calculate it somewhere else and move into XMM0. I'm using the AT&T syntax.

I think the second idea is actually possible, but I don't know how :)

Does anybody know how to do it?

Gabe
  • 84,912
  • 12
  • 139
  • 238
pawel
  • 5,976
  • 15
  • 46
  • 68
  • sinus? I don't think that means what you think it does (and it's not a verb). – Mahmoud Al-Qudsi May 13 '12 at 10:25
  • fsin doesn't sinus the value on top of the stack? – pawel May 13 '12 at 10:27
  • It's called 'sine' in English. – zch May 13 '12 at 11:03
  • This answer is relevant: http://stackoverflow.com/a/1845204/1256624 (Summary, SSE doesn't appear to provide a native `sin` instruction). Also, this page looks like it might help: http://gruntthepeon.free.fr/ssemath/ – huon May 13 '12 at 11:06
  • @dbaupp I know that SSE doesn't provie it, but maybe you know how to insert values from fpu stack into xmm0? – pawel May 13 '12 at 11:20
  • Google turns up [this](http://www.asmcommunity.net/board/index.php?topic=30778.0). (The second link I provided up above appears to have implementations of `sin`/`cos`/etc in SSE; these may even be more performant, due to vectorization and SSE generally being better etc.) – huon May 13 '12 at 11:27
  • The fsin (etc) instructions were pretty bad anyway. Only useful when optimizing for size - and in that case you probably won't be using SSE. This may be useful: http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/ (add reduction if you're outside the range) – harold May 13 '12 at 12:46

1 Answers1

4

Three options:

  1. Use and existing library that computes sin on SSE vectors.
  2. Write your own vector sin function using SSE.
  3. Store the vector to memory, use fsin to compute the sine of each element, and load the results. Assuming that your stack is 16-byte aligned and has 16-bytes of space, something like this:

       movaps  %xmm0, (%rsp)
       mov     $3,     %rcx
    0: flds   (%rsp,%rcx,4)
       fsin
       fstps  (%rsp,%rcx,4)
       sub     $1,     %rcx
       jns     0b
    

(1) is almost certainly your best bet performance-wise, and is also the easiest. If you have significant experience writing vector code and know a priori that the arguments fall into some range, you may be able to get better performance with (2). Using fsin will work, but it's ugly and slow and not particularly accurate, if that matters.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269