1

I would like to convert this code from NASM 32-bit SSE to NASM 64-AVX. Is it possible to find a way to do it easily?

For conversion into 64-bit code, I would attempt to completely rewrite the 32-bit code. However, I assume this is very hard work, and I suppose that there is an almost automatic way to do everything.

Are you aware of any kind of process like this? For example replacing the name of the registers?

Example:

  • Change eax to rax, ebx to rbx, and so on...
  • Change movaps with vmovaps and so on...
  • ...

Here is my 32-bit NASM source code:

section .text             
global test

  a           equ     8   
  b           equ     12  
  num         equ         16      
  spuri       equ         20
  result      equ     24

test:
  push    ebp             
  mov     ebp, esp        
  push    ebx             
  push    esi
  push    edi

  mov         esi, [ebp+a]                
  mov         edi, [ebp+b]                
  mov         ebx, 0              
  mov         ecx, [ebp+num]              
  mov         edx, [ebp+spuri]
  mov         eax,[ebp+result]                
  xorps       xmm1,xmm1           
  xorps       xmm3,xmm3           

loop1:
  cmp ecx,0
  je end
  movups      xmm0, [esi+ebx]     
  movups      xmm6, [edi+ebx]
  subps       xmm0, xmm6          
  mulps       xmm0, xmm0          
  sqrtps      xmm0, xmm0
  addps       xmm1, xmm0          
  add         ebx, 16             
  dec         ecx                 
  jnz         loop1

end:
  haddps      xmm1,xmm1
  haddps      xmm1,xmm1
  addps       xmm1,xmm3
  movups      [eax],xmm1

  pop edi                     
  pop     esi
  pop     ebx
  mov esp, ebp                
  pop ebp                     
  ret      

I have tried to convert this code but without any good result.

zx485
  • 28,498
  • 28
  • 50
  • 59
A.Berg
  • 11
  • 2
  • Why not just switch to intrinsics ? It's a fairly simple operation and it would be much more easily maintained, debugged and also more portable if re-written using C and intrinsics. (You could even just try expressing it in straightforward scalar C code first, and see if the compiler will auto-vectorize it for you.) – Paul R May 20 '16 at 14:49
  • Also the `spuri` parameter seems to be unused ? (EDIT: OK - it looks like this is really an update to [your previous question](http://stackoverflow.com/questions/37257665/c-calling-conventions-32bit-to-nasm-with-float-movups-movupd-difference) where there was more another loop after `loop1` ?) – Paul R May 20 '16 at 14:58
  • This is a part of code. I can't use intrinsics, only SSE. How I can convert this code in 64 bit? – A.Berg May 20 '16 at 15:02
  • Yes Paul, this is only an example code – A.Berg May 20 '16 at 15:03
  • 1
    Um, intrinsics *are* SSE, but they let you write the SSE code in C (or C++) rather than in assembler - makes life a lot easier, especially if you're not experienced with writing assembler. More portable too - switching form 32 bit to 64 bit then only requires a re-compile - no code changes. – Paul R May 20 '16 at 15:03
  • 1
    Whoever wrote this code must have been insane - it looks like they are using `mulps` followed by `sqrtps` because they didn't know how to implement `fabsf()` ??? Unoptimised scalar C code would have been more efficient. – Paul R May 20 '16 at 15:28

1 Answers1

1

Porting 32bit to 64bit is orthogonal from porting SSE to 256b AVX. However, doing both at once does mean you only have to go over every line of the code once, instead of once for each task.

SSE->AVX: if you have any shuffles or anything, it's tricky because the AVX versions of existing SSE instructions essentially do two separate 128b operations in the two "lanes".

32->64b: The ABI is different, in terms of arg passing and which regs you need to save/restore. Pointers generally need to go in 64b registers, but use 32bit registers when possible. Writing a 32bit reg zero-extends to the full 64b reg, so you still zero a register with xor eax,eax. xor rax,rax is a waste of an instruction byte. If you aren't going to care about this level of optimization, just write it in C.

Use RIP-relative addressing for static/global data.

See the tag wiki for links.


If you don't have anyone better at asm than the people that wrote mulps xmm0, xmm0 / sqrtps xmm0, xmm0, then give up on ASM and rewrite your code in C. Your biggest speedups will come from fixing stuff like that in your existing codebase, even more than the potential factor-of-two speedup from doubling the vector width.

Any good optimizing compiler can auto-vectorize many simple scalar loops pretty well these days. Use an up-to-date version of gcc or clang.

You can also use C/C++ intrinsics, so you can build 32 or 64bit executables with the same manual vectorization.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Yes indeed - the correct solution to this [XY problem](http://meta.stackexchange.com/q/66377/142198) is to back away from the original inefficient and badly-written 32 bit assembly code and get a C compiler to to the heavy lifting where possible, and then failing that use intrinsics to do explicit SIMD code as a last resort. Solves all the portability and efficiency problems in one go, and makes the code a lot more robust, maintainable and portable in the long run. – Paul R May 21 '16 at 07:49