1

I'm trying to implement _mm_add_epi32 in golang assembly, optionally with help of avo. But I know little about assembly and do not even know how to start it. Can you give me some hint of code? Thank you all.

Here's the equivalent slower golang version:

func add(x, y []uint32) []uint32 {
    if len(x) != len(y) {
        return nil
    }

    result := make([]uint32, len(x))
    for i := 0; i < len(x); i++ {
        result[i] = x[i] + y[i]
    }
    return result
}

I know that the struction paddq xmm, xmm is what we need, but do not kown how to convert a slice of []byte to the 256 bit register YMM.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
shenwei356
  • 128
  • 8
  • 1
    The function you have there is a bit different from `_mm_add_epi32` in that it takes slices of arbitrary lengths. Is this intentional? – fuz Aug 04 '20 at 08:38
  • Thanks for your reply, using `[]uint32` or `[8]uint32` are both OK, the first is for keeping data structure consistency with function caller. – shenwei356 Aug 04 '20 at 08:49
  • 3
    Well, supporting slices of arbitrary length makes the code quite a bit more complex, so it's best if you would pick one of the two so I can write an answer. – fuz Aug 04 '20 at 08:56
  • `[8]uint32` is good enough, thank you in advance. – shenwei356 Aug 04 '20 at 08:59

1 Answers1

2

Here's an example for such an addition function:

    // func add(x, y [8]int32) [8]int32
    // q = x + y
TEXT ·add(SB),0,$0
    VMOVDQU x+0(FP), Y0
    VPADDD  Y+32(FP), Y0, Y0
    VMOVDQU Y0, q+64(FP)
    VZEROUPPER
    RET

Before reading this code, familiarise yourself with this document. Unfortunately, Go-style assembly (aka Plan 9-style assembly) is poorly documented.

Arrays are passed on the stack by value. A return value is passed as an extra rightmost argument read back by the caller. Use (FP) as documented in the document I linked to access function arguments.

Apart from that, it's pretty straightforward. The syntax is similar (but not equal) to AT&T syntax. Note that the register names are different and giving a size suffix is mandatory.

As you can see, writing an assembly function for a single operation is pretty pointless. It's probably going to work a lot better to take the algorithm you need and write it completely in assembly.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • Thank you so much, I'll give feedback ASAP. – shenwei356 Aug 04 '20 at 10:25
  • Thank you @fuz, it works!!! > As you can see, writing an assembly function for a single operation is pretty pointless. You're right, actually I've optimized some other parts of my application, and this looked like the last part that could be optimized, so I tried this. Thank you very much. – shenwei356 Aug 04 '20 at 11:50
  • @WeiShen If you post some more context, I might be able to provide better optimisation advice. Feel free to make a new question and link it here so I can find it. Also consider using `go tool pprof` to find what parts of the application are actually relevant to performance. – fuz Aug 04 '20 at 11:59
  • Yes, I use pprof a lot. Here's the [pprof result]( https://gist.github.com/shenwei356/35d336dbb273c1e03e625b6034267c39#file-__mm_add_epi32-md) – shenwei356 Aug 04 '20 at 12:08
  • @WeiShen Line breaks are unfortunately not allowed in comments, so your code section comes out pretty garbled. If you could show me the context in which you want to use this function, I might be able to provide even better optimisation advice. As I said earlier, it's probably best to make a new question for that. – fuz Aug 04 '20 at 12:11
  • Hi @fuz, I publish another post , hope you can read it. https://stackoverflow.com/questions/63248047/golang-assembly-implement-for-inplace-updating-8uint32-with-another-one – shenwei356 Aug 04 '20 at 13:34
  • AVX memory operands don't have to be aligned; you could fold one of the loads into `vpaddd` – Peter Cordes Aug 04 '20 at 14:09
  • @PeterCordes Thanks! I totally missed that. – fuz Aug 04 '20 at 14:14