Fast transpose byte matrix [][]byte in Golang assembly

Question

Matrix transpose in pure golang is slow, and using package gonum needs structure transformation which costs extra time. So a assembly version may be a better solution.

Sizes of the matrix vary ([][]byte) or can be fixed ([64][512]byte), and the element type may be int32 or int64 for general scenarios.

Below is a golang version:

m := 64
n := 512

// orignial matrix
M := make([][]byte, m)
for i := 0; i < m; i++ {
    M[i] = make([]byte, n)
}


func transpose(M [][]byte) [][]byte {
    m := len(M)
    n := len(M[0])
    
    // transposed matrix
    T := make([][]byte, n)
    for i := 0; i < n; i++ {
        T[i] = make([]byte, m)
    }

    var row []byte // a row in T
    for i := 0; i < n; i++ {
        row = T[i]
        for j = 0; j < m; j++ {
            row[j] = M[j][i]
        }
    }
    
    return T
}

I assume this is SSE/AVX2 again, like your previous questions? Did you search SO for existing transpose Q&As with SIMD shuffles? Compiling + translating already-solved problems to Go asm doesn't sound interesting, so hopefully there's something about this question that makes this different. — Peter Cordes, Aug 05 '20 at 03:07
Thank you Peter. My searches limited to golang so missed solutions with SIMD, however most posts transpose 4x4, 8x8, 8x32 matrix so it can be optimized with SIMD intrinsics. The case here has a big matrix, but we may split into small ones. — shenwei356, Aug 05 '20 at 03:31
In Go, does `T[i] = make([]byte, m)` mean that each row is separately allocated? That's usually bad for performance; it means getting to the same column(s) of the next row means dereferencing a different pointer, instead of just adding a known row stride. It also means your byte matrix probably isn't contiguous. With 512 rows of 64 bytes (1 cache line) each, that means you have to touch maybe 1.33 or 1.5x as much data in L1d cache: a cache line for the row pointer (which presumably is an object with a couple pointers), and the row data itself. — Peter Cordes, Aug 05 '20 at 03:44
SSE doesn't have strided loads, so you're not missing out on that, but it's not a good way to store a matrix. — Peter Cordes, Aug 05 '20 at 03:45
[What is the fastest way to transpose a matrix in C++?](https://stackoverflow.com/q/16737298) has a manually-vectorized SSE (32-bit element size) implementation for arbitrary matrix sizes, using a 4x4 building block. It's cache-blocked for good performance with large matrices. — Peter Cordes, Aug 05 '20 at 03:51
Of course you could just call a BLAS library function, or copy asm from an open-source optimized BLAS library since apparently Go makes it inefficient to call C functions. At least for the 32-bit and 64-bit transposes. Maybe not byte. — Peter Cordes, Aug 05 '20 at 04:11
Yes, each row in `[][]byte` is separately allocated, and the source of the row is independent too, converting them to a single array refering to https://stackoverflow.com/a/16743203/1591876 needs extra time and won't bring performance benefit. Thank you. — shenwei356, Aug 05 '20 at 05:43
I meant you should improve your whole Go program to store your matrices efficiently, in contiguous memory like a 2D C array, of course not convert on the fly to feed a transpose routine. Regardless, [What is the fastest way to transpose a matrix in C++?](https://stackoverflow.com/a/16743203) still works; the actual transpose algorithm (the SIMD shuffles) would work the same regardless of how columns are allocated. AFAICT it doesn't depend on reading past the end of one row and getting data from the start of the next row. — Peter Cordes, Aug 05 '20 at 06:00
The matrix is extracted multiple locations from tens of memory-mapped files with total size of tens of GB, so it's hard to change the layout. But I can control the size of byte matrix which acts as kind of buffer for batch processing. Width and height of the matrix are times of 64 to be cache-friendly, this improves the performance a little. Thank you Peter. — shenwei356, Aug 05 '20 at 08:16
I suppose the end goal is to compute the positional popcount of every column of the matrix? In this case, it might be a lot faster to simply implement a wider positional popcount operation. — fuz, Aug 05 '20 at 08:38
Yes I'm applying [COBS algorithm](https://github.com/bingmann/cobs) to my [application](https://github.com/shenwei356/unikmer) (unikmer db index/search) with different input and some data structure difference. COBS uses [_mm_add_epi16](https://github.com/bingmann/cobs/blob/master/cobs/query/classic_search.cpp#L146) for pospopcnt, but these's no native support for golang. Pospopcnt is the bottleneck in searching the index. With your golang asm version of Pospopcnt, my app performs 2X faster and close to cobs now (11s vs 8s). Thank you so much. And now, byte matrix transpose is the bottleneck. — shenwei356, Aug 05 '20 at 09:11
@shenwei356 Sounds interesting. Bio-informatics is lots of fun. Try to port one of the transposition algorithms linked above. They'll probably do just fine for your use case. — fuz, Aug 05 '20 at 09:22
@fuz: If you want a really wide pospopcnt (so you can use it before transpose, and then sum instead of transpose), can we keep sums in each byte element of multiple YMM registers? Like maybe 32-bit broadcast load and `vpsrlvd` by `0,1,2,3,4,5,6,7` to line up different bits with the bottom of a byte, then mask with `set1_epi8(1)`, and `vpaddb` into the right vector accumulator? I haven't fully thought through if this works, but AVX2 variable-shift is efficient on Skylake. (Not great on Haswell). Or maybe just non-bcast mask, add, normal `vpsrld` by 1, repeat with different accumulators. — Peter Cordes, Aug 05 '20 at 14:27

Fast transpose byte matrix [][]byte in Golang assembly

0 Answers0

Linked