1

this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation. C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format) On exit, A and B maintain their input values.

void matmul_optimized(int n, int *A, int *B, int *C)
{
    // to the effective bitwise calculation
    // save the matrix as the different type
    int i, j, k;
    int cij;
    for (i = 0; i < n; ++i) {
        for (j = 0; j < n; ++j) {
            cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
            for (k = 0; k < n; ++k) {
                cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
            }
            C[i + j * n] = cij; // allocate the final result into C         }
    }
}

how do I more speed up the multiplication of matrix based on above function/method?

this function is tested up to 2048 by 2048 matrix.

the function matmul_optimized is done with matmul.

#include <stdio.h>
#include <stdlib.h>

#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"

int main()
{
    int i, j;
    int n = 1024; // Number of rows or columns in the square matrices
    int *A, *B;   // Input matrices
    int *C1, *C2; // Output matrices from the reference and optimized implementations

    // Performance and correctness measurement declarations
    long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
    long int COUNTER, REPEAT = 5;
    int difference;
    float speedup;

    // Allocate memory for the matrices
    A = malloc(n * n * sizeof(int));
    B = malloc(n * n * sizeof(int));
    C1 = malloc(n * n * sizeof(int));
    C2 = malloc(n * n * sizeof(int));

    // Fill bits in A, B, C1
    fill(A, n * n);
    fill(B, n * n);
    fill(C1, n * n);

    // Initialize C2 = C1
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            C2[i * n + j] = C1[i * n + j];

    // Measure performance of the reference implementation
    CLOCK_total = 0;
    for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
    {
        CLOCK_start = cpucycles();
        matmul_reference(n, A, B, C1);
        CLOCK_end = cpucycles();
        CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
    }
    CLOCK_ref = CLOCK_total / REPEAT;
    printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);

    // Measure performance of the optimized implementation
    CLOCK_total = 0;
    for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
    {
        CLOCK_start = cpucycles();
        matmul_optimized(n, A, B, C2);
        CLOCK_end = cpucycles();
        CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
    }
    CLOCK_opt = CLOCK_total / REPEAT;
    printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);

    speedup = (float)CLOCK_ref / (float)CLOCK_opt;

    // Check correctness by comparing C1 and C2
    difference = 0;
    for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
            difference = difference + C1[i * n + j] - C2[i * n + j];

    if (difference == 0)
        printf("Speedup factor = %.2f\n", speedup);
    if (difference != 0)
        printf("Reference and optimized implementations do not match\n");

    //print(C2, n);

    free(A);
    free(B);
    free(C1);
    free(C2);
    return 0;
}
연승현
  • 35
  • 1
  • 8
  • Seems like you need to use OpenMP or OpenCL. I.e. use parallel executions on GPU (or vectoring on CPU). See [viennacl](http://viennacl.sourceforge.net/viennacl-examples-dense-matrix.html) – Victor Gubin Mar 19 '19 at 15:52
  • 1
    That depends on various factors, e.g. the size of your matrix and whether you define "fast" in terms of computation (think of parallelism, SIMD, loop blocking etc. which might heavily depend on your target architecture) or in an asymptotical way (think of [Strassen's algorithm](https://en.wikipedia.org/wiki/Strassen_algorithm), for example). Can you provide more information? – andreee Mar 19 '19 at 16:02
  • I have to make some efficiency algorithm such as Eigen or uBLAS in the first function. – 연승현 Mar 19 '19 at 16:08
  • should I divide the each elements into byte? – 연승현 Mar 19 '19 at 16:09
  • One more remark: Your benchmark might be tainted due to caching effects. Before performing time measurements on the same data set (in your case, the matrices), you should try to evict any data present in cache. For serious benchmarking, you should stick to libraries such as the [Google Benchmarking Library](https://github.com/google/benchmark), but that's a little off-topic here I admit. – andreee Mar 19 '19 at 16:19
  • I once fiddled a bit with matrix multiplication to get in touch with multi-threading. Finally, I found out that transposing one of the matrices to improve cache locality (in single threading) was as good as all my multi-threading approaches. [SO: Multi-threading benchmarking issues](https://stackoverflow.com/a/52835213/7478597) – Scheff's Cat Mar 19 '19 at 16:29
  • 1
    How are the `&` and `^` supposed to do multiplication and addition? What do you mean when you say your matrices contain bits? This looks to me like it would only ever work in Z₂ and *moreover* if A and B only contained ones and zeroes for entries. If this is the case, using 64 bits for one entry would be quite some waste. – The Vee May 02 '19 at 19:54

2 Answers2

0

You can try algorithm like Strassen or Coppersmith-Winograd and here is also a good example. Or maybe try Parallel computing like future::task or std::thread

Sam Xia
  • 171
  • 11
0

Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:

  • First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).

  • Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.

  • Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.

A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).

(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)

You may find the following thread relevant: How does BLAS get such extreme performance?