Fast integer matrix multiplication with bit-twiddling hacks

Question

I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20).

To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15.

The following naive C++ implementation executed a million times performs around 1s, measured with linux time.

#include <random>

int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);

int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
    //Set up A[] and B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            A[i][j] = distr(eng);
            B[i][j] = distr(eng);
            C[i][j] = 0;
        }
    }
    //Compute C[]=A[]*B[]
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}
return 0;
}

Notes:

The matrices are not necessarily sparse.
Strassen-like comments does not help here.
Let's try not to use the circumstantial observation, that in this specific problem the matrices A[] and B[] can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices.
Computation is single-threaded.

Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?

What proportion of the above is actually attributable to the multiply, as opposed to the randomisation? — Oliver Charlesworth, May 08 '16 at 10:41
Fair comment. Changing the number of trials gives you a fair answer. — Matsmath, May 08 '16 at 10:41
But back to the actual question. I think the first (and probably only major) improvement would be to structure this to ensure that the compiler can vectorise it (alternatively, do explicit vectorisation). — Oliver Charlesworth, May 08 '16 at 10:44
Look at using the hardware to do the work for you. MMX instructions and be careful about caching. This can make more of a difference than the algorithm choice. — T33C, May 08 '16 at 10:48
Why do you necessary need bit hacks? Transpose matrix B to get predictable linear memory access in most inner loop (then prefetcher will do all the work with memory) and, if you want, add SSE mul+add instruction — qwwdfsad, May 08 '16 at 12:28
Assume that I already did all of that. Then I can ask the same question again: would bit-wizardry improve the performance further? — Matsmath, May 08 '16 at 12:31
@Matsmath: "Bit-wizardry" is likely to be mutually exclusive with SSE, etc. And likely not to be faster - the fastest way to do a bunch of multiplies is to use hardware that's optimised for doing a bunch of multiplies. Even a lookup-table approach is unlikely to be useful - the tables would have to be large, and thus likely to blow from a cache perspective. — Oliver Charlesworth, May 08 '16 at 12:43
@OliverCharlesworth in [this thread](http://stackoverflow.com/questions/18447321/binary-matrix-multiplication-bit-twiddling-hack) it is claimed that 23x speed improvement was obtained after bit-wizardry with binary matrices. Where does *that* improvement come from? — Matsmath, May 08 '16 at 12:52
@Matsmath - Well, for a start, that's not about integer arithmetic. So SSE hardware is completely useless. And concepts like carry aren't important - it's by definition a bitwise calculation. And given that the domain is smaller, the encoding of lookup-tables is tractable (see the magic numbers in the accepted answer). — Oliver Charlesworth, May 08 '16 at 12:54
@OliverCharlesworth: Why do you think SSE/AVX is useless for bit hacks? It has bit-shifts, packed-compares, and boolean AND/OR/XOR/NAND. It also has a byte-granularity shuffle with the shuffle-control being another vector, not a compile-time constant. (SSSE3 `pshufb`). SSE/AVX aren't as powerful as BMI1/BMI2 (esp `pext`/`pdep`) for manipulating elements narrower than a byte, of course. — Peter Cordes, May 09 '16 at 22:43

score 3 · Accepted Answer · edited May 23 '17 at 12:07

The question you linked is about a matrix where every element is a single bit. For one-bit values a and b, a * b is exactly equivalent to a & b.

For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.

A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (i.e. in pure C with uint64_t) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.

packed vs. unpacked-to-bytes storage formats

If you have very many of these tiny matrices, storing them in memory in compressed form (e.g. packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.

Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.

Compilers suck at this with uint8_t in scalar C code for x86. See comments on @Richard's answer: gcc and clang both like to use mul r8 for uint8_t, which forces them to move data into eax (the implicit input/output for a one-operand multiply), rather than using imul r32, r32 and ignoring the garbage that leaves outside the low 8 bits of the destination register.

The uint8_t version actually runs slower than the uint16_t version, even though it has half the cache footprint.

You're probably going to get best results from some kind of SIMD.

Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb can zero elements as well as copy them around.

More likely to be useful is SSE2 PMADDWD, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd (_mm_madd_epi16) is one horizontal add away from giving you the dot-product result you need for C[i][j].

Instead of doing each of those adds separately, you can probably pack multiple pmaddwd results into a single vector so you can store C[i][0..2] in one go.

Richard Hodges · Answer 2 · 2016-05-08T14:49:49.523

2

You may find that reducing the data size gives you a considerable performance improvement if you are performing this calculation over a large number of matrices:

#include <cstdint>
#include <cstdlib>

using T = std::uint_fast8_t;

void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
}

The pentium can move and sign-extend an 8-bit value in one instruction. This means you're getting 4 times as many matricies per cache line.

UPDATE: curiosity piqued, I wrote a test:

#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>

template<class T>
struct matrix
{
    static constexpr std::size_t rows = 3;
    static constexpr std::size_t cols = 3;
    static constexpr std::size_t size() { return rows * cols; }

    template<class Engine, class U>
    matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
    : matrix(std::make_index_sequence<size()>(), engine, dist)
    {}

    template<class U>
    matrix(std::initializer_list<U> li)
    : matrix(std::make_index_sequence<size()>(), li)
    {

    }

    matrix()
    : _data { 0 }
    {}

    const T* operator[](std::size_t i) const {
        return std::addressof(_data[i * cols]);
    }

    T* operator[](std::size_t i) {
        return std::addressof(_data[i * cols]);
    }

private:

    template<std::size_t...Is, class U, class Engine>
    matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
    : _data { (void(Is), dist(eng))... }
    {}

    template<std::size_t...Is, class U>
    matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
    : _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
    {}


    T _data[rows * cols];
};

template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
    matrix<T> C;
    for (int i = 0; i < 3; ++i) {
        for (int j = 0; j < 3; ++j) {
            for (int k = 0; k < 3; ++k) {
                C[i][j] = C[i][j] + A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
    v.clear();
    v.reserve(test_size);
    generate_n(std::back_inserter(v), test_size,
               [&] { return matrix<T>(eng, dist); });
}

template<class T>
void test(std::random_device& rd)
{
    std::mt19937 eng(rd());
    std::uniform_int_distribution<T> distr(0, 15);

    std::vector<matrix<T>> As, Bs, Cs;
    fill(As, eng, distr);
    fill(Bs, eng, distr);
    fill(Cs, eng, distr);

    auto start = std::chrono::high_resolution_clock::now();
    auto ia = As.cbegin();
    auto ib = Bs.cbegin();
    for (auto&m : Cs)
    {
        m = *ia++ * *ib++;
    }
    auto stop = std::chrono::high_resolution_clock::now();

    auto diff = stop - start;
    auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
    std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;

}

int main() {
    //Random number generator
    std::random_device rd;
    test<std::uint64_t>(rd);
    test<std::uint32_t>(rd);
    test<std::uint16_t>(rd);
    test<std::uint8_t>(rd);
}

example output (recent macbook pro, 64-bit, compiled with -O3)

for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us

summary:

on this platform, int32 and int16 proved to be as fast as each other. int64 and int8 were equally slow (the 8-bit result surprised me).

conclusion:

As ever, express intent to the compiler and let the optimiser do its thing. If the program is running too slowly in production, take measurements and optimise the worst-offenders.

edited May 08 '16 at 14:49

answered May 08 '16 at 11:17

Richard Hodges

68,278
7
90
142

Given that these are small matrices (thus no individual matrix exceeds the cache, thus each element is loaded precisely once), this improvement is really just down to raw memory bandwidth, rather than matrices per cache line. – Oliver Charlesworth May 08 '16 at 11:34
@OliverCharlesworth fair point. However I think the wording either way works. Memory is burst-loaded in cache-line-sized blocks, and indeed is linearly prefetched (in most architectures). The performance increase is likely to be considerable. BTW I should have used an unisigned int8 in case of overflow. will update. – Richard Hodges May 08 '16 at 11:41
Well, this is interesting. I removed the `random` stuff and set `A[]` and `B[]` to some deterministic value (the same value for all `trials`) and actually this new type *decreased* performance as execution time has *risen* from 0.150s to 0.175s. – Matsmath May 08 '16 at 12:30
1

@Matsmath it's going to depend on whether you're doing the operation a million times on the same matrix, or once for each of a million matrices. Because of optimisation and (moreso) memory caching, performing the same operation on one matrix over and over will not give a useful benchmark. – Richard Hodges May 08 '16 at 14:08
True. Perhaps I should precompute the data set in advance. – Matsmath May 08 '16 at 14:30
@Matsmath I wrote a test, see updated answer. Individual mileage may well vary. – Richard Hodges May 08 '16 at 14:50
Note that `uint_fast16_t` is a 64bit type in x86_64-linux-gcc. fast8 is an 8bit type, but the wider "fast" types are all 64bit, which makes no sense. x86 can load and sign or zero extend any narrow value to 32 or 64 bits as cheaply as a regular load instruction, and Intel CPUs even decode it to a single uop that only uses the load port (no ALU uop necessary, since the load ports have this built-in). (See [Agner Fog's insn tables](http://agner.org/optimize/).). Anyway, `uint_fast8_t` is ok, but don't use `uint_fast16_t`, because it will make a bad choice on x86-64. – Peter Cordes May 09 '16 at 23:20
For `uint8_t`, clang and gcc both do some of the multiplies with the `mul r/m8` instruction, which does `ax = al * src`. gcc's code sometimes reads `eax` or `rax` after that partial-register write, leading to a partial-register stall on Intel CPUs before Sandybridge. On Sandybridge, it's just an extra merging uop without a stall, and on Haswell there's no penalty. (Having to move operands into / out of `eax` takes extra instructions, though.) – Peter Cordes May 09 '16 at 23:51
On my Core2 (Conroe E6600), g++ 5.2 -O3 -march=native: `type m 60639us, type j 32018us, type t 25464us, type h 29776us`. clang++-3.8 -O3 -march=native: `type m 63246us, type j 30779us, type t 18430us, type h 23202us`. My system has 4MB of last-level cache, and slow DDR2 memory. **What microarchitecture and compiler / version were you testing with?** – Peter Cordes May 09 '16 at 23:54
@PeterCordes whatever a modern MacBook Pro is. Some kind of i7 I think. – Richard Hodges May 10 '16 at 08:53
@PeterCordes with the latest Xcode. (Apple clang 7?) – Richard Hodges May 10 '16 at 08:56
I know Apple uses Intel chips. Intel has been using the i7 name since Nehalem, years ago, so that adds no new information. The big change to CPU internals was the generation after Nehalem (Sandybridge), which made partial-register slowdowns much less of a problem by not stalling while inserting the merging uop. You probably don't have a Nehalem if you describe your macbook as "recent", but that's all I can conclude. Haswell added a 4th integer ALU execution port, which may be relevant here. Isn't there a `lscpu` or `x86info` or `cat /proc/cpuinfo` you can run on OS X? – Peter Cordes May 10 '16 at 18:52
Although probably the main answer to 8bit being slower is that there's no 2-operand `imul r8, r8`, only `imul r16, r16` and wider, and clang doesn't just do a wider imul and keep the low 8 bits of the result to get a correctly-wrapped `uint8_t`. So it has to use more instructions getting data into/out of `eax` for `mul`. – Peter Cordes May 10 '16 at 18:55
@PeterCordes does this help? $ sysctl -n machdep.cpu.brand_string Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz – Richard Hodges May 10 '16 at 19:10
@PeterCordes memory is 16 GB 1600 MHz DDR3 – Richard Hodges May 10 '16 at 19:15
Yup, 4xxx means Haswell. Partial-register issues are not a factor on Haswell, so I'm a bit surprised that it runs so slowly for `uint8_t`. clang 3.8 is very new, so probably older versions make worse code. I think the 8bit version isn't memory bound. Probably the 64bit version is, which is why it runs so much slower. (Haswell [has an `imul r64, r64` that's just as fast as `imul r32, r32`](http://agner.org/optimize)). – Peter Cordes May 10 '16 at 19:16
@PeterCordes looking at the assembler, clang is expressing the uint8_t version in terms of 8-bit single-argument mulb instructions, whereas the 16 and 32 bit versions are implemented in terms of the 2 or 3-argument imull instruction. The 8-bit version does a lot more housekeeping (moving to/from the ax register). – Richard Hodges May 10 '16 at 19:30
Yup, that's exactly what I saw in clang's asm. Note that it *could* have still done [`imul r32,r32` and used the low 8 bits of the result](http://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in/34377712#34377712) and then zero-extended that result. (Or not, since the sum of products is truncated back to 8 bits, so high garbage all the way to there is fine.) Neither gcc nor clang seems to want to do that, though. I did see some 2-operand `imul` instructions in the uint8_t code, though. – Peter Cordes May 10 '16 at 19:34

Fast integer matrix multiplication with bit-twiddling hacks

2 Answers2

packed vs. unpacked-to-bytes storage formats

You're probably going to get best results from some kind of SIMD.

Linked