2

I am implementing a mathematical library in C which heavily uses multiplications. Initially, all my multiplications were done using uint16_t. Recently I change many of them to uint32_t and I saw my code runtime became almost double. I was confused as I thought in Intel x64 processors, 32 and 16-bit multiplications take same clock cycles.

I wrote a diagnosis code, please find it below

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include "cpucycles.c"

#define REPEAT 10000
#define OUT_REPEAT 100000

void main(){

    uint16_t a_16[REPEAT], b_16[REPEAT], c_16[REPEAT];
    uint32_t a_32[REPEAT], b_32[REPEAT], c_32[REPEAT];
    int32_t i,j;
    uint64_t clock1, clock2, CLOCK16, CLOCK32;
    uint64_t acc=0;
    time_t t;

    srand((unsigned) time(&t));

    clock1=clock2=CLOCK16=CLOCK32=0;

    for(j=0;j<OUT_REPEAT;j++){

        
        for(i=0;i<REPEAT;i++){

            a_16[i]=rand()& ( (1<<13) -1); //need 13-bit integers only
            b_16[i]=rand()& ( (1<<13) -1);

            a_32[i]=rand()&( (1<<19) -1);
            b_32[i]=rand()&( (1<<19) -1); //need 19-bit integers only
        }
        
        clock1=cpucycles();
        for(i=0;i<REPEAT;i++){
            c_16[i]=a_16[i]*b_16[i];
        }
        clock2=cpucycles();
        CLOCK16=CLOCK16+(clock2-clock1);    

        clock1=cpucycles();
        for(i=0;i<REPEAT;i++){
            c_32[i]=a_32[i]*b_32[i];
        }
        clock2=cpucycles();
        CLOCK32=CLOCK32+(clock2-clock1);    

        for(i=0;i<REPEAT;i++){
            acc=(acc+(c_32[i]-(uint32_t)c_16[i])); //this is just to prevent compiler optimization
        }
        printf("Iteration: %d,  acc:%llu\n", j, acc);
        acc=0;
    }

    printf("\n--------------------------------------------\n");
    printf("Time for 16 bit multiplication : %llu\n", CLOCK16/OUT_REPEAT);
    printf("Time for 32 bit multiplication : %llu\n", CLOCK32/OUT_REPEAT);
    printf("\n--------------------------------------------\n");
}

The cpucycles code is from ECRYPT and given below,

#include "cpucycles.h"

long long cpucycles(void)
{
  unsigned long long result;
  asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
    : "=a" (result) ::  "%rdx");
  return result;
}

Result from one sample run, using single core and hyperthreading/TurboBoost disabled

--------------------------------------------
Time for 16 bit multiplication : 2795
Time for 32 bit multiplication : 4190

--------------------------------------------

and finally my cpuinfo (excerpt) as given by lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Model name:            Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

Now my question is,

  1. Is it correct that in x64 platforms 16-bit multiplications take almost half of the total time than 32-bit multiplications? Or I am doing something wrong.

  2. If yes, can you please show me some reference to justify this behavior?

I appreciate your help.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Rick
  • 361
  • 5
  • 17
  • You should look if your compiler has done some kind of vectorization. A pure, free standing mul needs 1 cycle independent of the data size but maybe your CPU has reordered your multiple mul into something different. – user6556709 Apr 26 '19 at 10:27
  • Potentially related https://stackoverflow.com/questions/692718/how-many-cpu-cycles-are-needed-for-each-assembly-instruction – Jovasa Apr 26 '19 at 10:27
  • @user6556709 Thanks, I will check it. If yes then how should I prevent it? – Rick Apr 26 '19 at 10:30
  • Maybe load and/or store of 32 bit values takes more time or causes more cache misses. Did you get diffrence when 16 bit mul with 32 variables?: c_32[i]=((uint16_t)a_32[i]) * (uint16_t)b_32[i]; – SKi Apr 26 '19 at 10:31
  • As always, Agner Fog's instruction tables are thing to check whenever you really _really_ care about instruction latancy and throughput. https://www.agner.org/optimize/instruction_tables.pdf Picking Skylake at random, this shows latancy/throughput for mul is the same for 16 and/or 32 bit registers. Your differences are coming from somewhere else than the instruction cycle count itself. – Mike Vine Apr 26 '19 at 10:31
  • @Rick With GCC: __attribute__((optimize("no-tree-vectorize"))) at your function or for timing the CPU maybe -O0 would be the better solution. – user6556709 Apr 26 '19 at 10:33
  • 1
    @user6556709: Seriously, no. Don't ever suggest `-O0` when it comes to timing C++ code. – MSalters Apr 26 '19 at 11:47
  • @MSalters Seriously, read the code. It is not about testing C++... – user6556709 Apr 26 '19 at 11:52
  • Typo, that, and irrelevant. The same holds for C. – MSalters Apr 26 '19 at 11:55
  • @MSalters It is also not about testing C. And again read the code and what he wants to do. – user6556709 Apr 26 '19 at 11:58
  • 1
    Hello everyone, I checked the assembly and yes it is due to the vectorization. The compiler was using 256 bit ymm registers. Hence for uint16_t it can perform 16 multiplications at a time or in case uint32_t it can perform 8 multiplications at a time. That explains the difference. – Rick Apr 26 '19 at 11:59
  • after using -fno-tree-vectorize the times are almost similar, the uint32_t is still slightly larger but that is most probably due to loading and storing. Thank you all again :-) – Rick Apr 26 '19 at 12:02
  • 1
    @user6556709: don't be ridiculous, `-O0` will completely destroy performance and give *different* bottlenecks than optimized scalar code. The gcc/clang option you're looking for is `-O3 -fno-tree-vectorize`. – Peter Cordes Apr 26 '19 at 19:13

1 Answers1

6

Is it correct that in x64 platforms 16 bit multiplications take almost half of the total time than 32 bit multiplications? Or I am doing something wrong.

No, that isn't correct by itself. Respectively it's not what you have actually tested.

Your short loops are trivially vectorizable, so that's just what the compiler did. Depending on the target CPU generation, that means there is a 128, 256 or 512bit vector type available which can be partitioned into different word sizes (8bit, 16bit, 32bit, 64bit, 128bit), and can then perform vectorized multiplications on multiple elements at once. Not just the multiplication, but also loading and storing the numbers from and to memory is fully vectorized and doesn't operate just on single elements.

To put it simple, you can fit twice as many 16bit integers in the same vector, compared to 32bit. And your code actually isn't limited by the multiplications either - it's limited purely by load / stores, so you have only successfully measured that 16bit integers are half as big as 32bit integers, so when vectorized and bound by load/stores you can load twice as many elements in the same time.

If you whish to benchmark a specific instruction (in this case single element multiplication), would you need to explicitly use that specific instruction by inline assembly. You would also need to be aware of all the side effects and preconditions which effect performance, pipelined super-scalar architectures are not trivial to benchmark in general.

Otherwise the compiler is expected to optimize (vectorize, fold, inline etc.) your code as much as possible.

Ext3h
  • 5,713
  • 17
  • 43
  • 1
    Considering the goal ("a mathematical library which heavily uses multiplications"), vectorization is exactly what you want. It makes sense to look at the generated assembly for critical code segments to see where the compiler failed to vectorize, _and treat that as a problem you need to solve_. – MSalters Apr 26 '19 at 11:50
  • @MSalters Yes it is true. But the specification specifies no use of vectors. – Rick Apr 26 '19 at 11:56
  • 2
    @Rick: Which specification, and more importantly, **why**? It's 2019. Even embedded CPU's now have vector units, and NVidia will even sell you a SoC with multiple matrix units. – MSalters Apr 26 '19 at 12:03
  • @MSalters yes it is true. I don't understand either :-|. It is an internal specification. – Rick Apr 26 '19 at 12:07
  • 2
    I would not feel particularly bound by that part of the specification, given that you can't realistically buy processors without vector units anymore. And x64 has had vector math since the start; every x64 compiler can safely assume that it can use SSE2. – MSalters Apr 26 '19 at 12:13
  • 2
    @Rick if there is actually such a requirement, make sure you understood it correctly. Make sure you didn't just confused "don't vectorize by hand" with "prevent vector instructions". The first one is reasonable, you should never attempt that if not absolutely unavoidable. The second one is complete nonsense, unless your target architecture is Pentium III or older. – Ext3h Apr 26 '19 at 13:35
  • Hi All, Yes I will ask. Thanks for the help. :-) – Rick Apr 26 '19 at 15:52