32bit floating division is not as slow as I expected

Question

My environment:

Xilinx Zynq (based on ARM Cortex A9)
PetaLinux V2014.2

I am developing a Linux application on Zynq using PetaLinux.

My current question is the processing time for four arithmetic operations (+/-/*/div).

I timed the processing time with clock_gettime() using following codes.

For addition(+):

static void funcToBeTimed_floatAdd(void)
{
    int idx;
    float fval = 0.0;
    for(idx=0; idx<100; idx++) {
        fval = fval + 3.14;
    }
}

For division(/):

static void funcToBeTimed_floatDiv(void)
{
    int idx;
    float fval = 314159000.00;
    for(idx=0; idx<100; idx++) {
        fval = fval / 1.001;
    }
}

For time measurement, following codes are used. The procNo is set using main(int argc, char *argv[])

static void disp_elapsed(int procNo)
{
    struct timespec tp1, tp2;
    long dsec, dnsec;

    /***/
    switch(procNo) {
    case 0:
        printf("add\n");
        clock_gettime(CLOCK_REALTIME, &tp1);
        funcToBeTimed_floatAdd();
        clock_gettime(CLOCK_REALTIME, &tp2);
        break;
    case 1:
        printf("multi\n");
        clock_gettime(CLOCK_REALTIME, &tp1);
        funcToBeTimed_floatMulti();
        clock_gettime(CLOCK_REALTIME, &tp2);
        break;
    default:
        printf("div\n");
        clock_gettime(CLOCK_REALTIME, &tp1);
        funcToBeTimed_floatDiv();
        clock_gettime(CLOCK_REALTIME, &tp2);
        break;
    }

    dsec = tp2.tv_sec - tp1.tv_sec;
    dnsec = tp2.tv_nsec - tp1.tv_nsec;
    if (dnsec < 0) {
        dsec--;
        dnsec += 1000000000L;
    }

    printf("Epalsed (nsec) = %ld\n", dnsec);
}

As a result, the processing time for addition(+) and for division(/) were both around 2500 nsec.

Generally, the division is more costly than addition, I think, but not much difference in this case.

I would like to know

What kind of optimization is applied to ARM
Keywords to search further information on this kind of optimization
(If any) some mistakes in the codes to check processing time (e.g. to avoid auto-optimization inside loop etc)

Try unrolling a little, addition should get faster and division won't. — Ben Voigt, Oct 06 '14 at 02:21
To get better help, also post the code you are using to do the timing. For example, [this](http://stackoverflow.com/questions/26190364/is-it-legal-for-a-c-optimizer-to-reorder-calls-to-clock) is a potential issue. — M.M, Oct 06 '14 at 03:01
32-bit float is only accurate to 6-7 digits. That means using 32-bit float you cannot even store the value 314159265.35 to its radix point — phuclv, Oct 06 '14 at 03:35
@MattMcNabb Thank you for your comment and the link. I added the function for time measurement. I will read the link. — sevenOfNine, Oct 06 '14 at 04:10
@LưuVĩnhPhúc Thank you for your pointing out on the 7digits. I corrected the code. — sevenOfNine, Oct 06 '14 at 04:11
Any reasonable compiler will optimize your functions to a single return of a constant, doing all the operations at compile time. If you want to time the instruction, you need to be sure you're actually using them... — Chris Dodd, Oct 06 '14 at 05:11
In fact it won't even turn them into a single return of a constant, it will just return, because that function is not doing anything. http://goo.gl/AeOq0U - Return fval from the functions, then they should be ok. — auselen, Oct 06 '14 at 06:28
@ Chris Dodd Thank you for your comment. @auselen Thank you for the comment and the link. It helps. By the way, I do not use -O2 nor -O options so that I can see the slowest case. — sevenOfNine, Oct 06 '14 at 07:45
There will be a *lot* of useless load and store cycles without optimization turned on, these will dominate the timing IMO. — Turbo J, Oct 06 '14 at 12:19
@sevenOfNine: Measuring without compiler optimization enabled is just useless. — Ben Voigt, Oct 06 '14 at 13:07
Does this matter being arm? Did you try it on a x86 machine? I would prefer removing the arm tag. — auselen, Oct 06 '14 at 13:44
@auselen I tried it on CentOS6.5 using x86 machine (Core i7-3770). The result was that division took about twice (2100nsec) compared with the addition (1120nsec). On the other hand, same code on ARM(Zynq) shows similar process time for division and addition as posted. I wonder some special optimization for ARM. That's why I add the arm tag. — sevenOfNine, Oct 07 '14 at 00:25

j123b567 · Accepted Answer · 2014-10-06T08:31:22.177

6

There may be several problems with your code:

You are not passing any argument to your function so optimization will probably precalculate its result.
You have big overhead of calling timing functions and calling your functions so the slow down is not visible.
Granularity of timer you use (try granularity test)
You are using floats as results but you are performing all operations in doubles - 3.14 is double, 3.14f is float.
100 cycles is too little to see anything reasonable, try to increase number of cycles to reach at least 1 second execution time.
You can try to disassable these functions to see what is the reality.
Are you compiling it with hardware floating point support?

edited Oct 06 '14 at 08:31

answered Oct 06 '14 at 07:22

j123b567

3,110
1
23
32

Thank you for various indications. – sevenOfNine Oct 06 '14 at 07:39
I may try to passing argument to avoid precalculation. About calling functions, what do you recommend? About resolution of timer, I would like to know "the order", not detailed processing time, so I used clock_gettime(). – sevenOfNine Oct 06 '14 at 07:42
I will consider to make the calculation float not double. In the compiling, I see no -O2 nor -O nor other hardware floating support option set in the Makefile. – sevenOfNine Oct 06 '14 at 07:44
OK, I edited the answer. Precision is not important, because it is in nanoseconds for you. Important is granularity - "length of one tick". Try to increase execution time to at least 1s by increasing the number of cycles. You are probably using software calculations and they are very slow. You can try some sort of optimization to minimize the overhead of all other things but floating point calculations. – j123b567 Oct 06 '14 at 08:37
Thank you very much for further information. I tried the granularity test. On ARM(Zynq) it was "clock_gettime: <= 2706ns gettimeofday: <= 3000ns". On x86 machine (Core i7-3770) it was "clock_gettime: <=180 ns. gettimeotday: 1000ns". – sevenOfNine Oct 07 '14 at 00:45
I modified the code increasing the execution time more than 1sec and put the code at http://ideone.com/vYWdjG. As a result, on ARM(Zynq), addition takes 1600sec while division takes 1400sec. There is still something I do not correctly understand (the reason division is faster than the addition). Anyway I appreciate all of you with a lot of information on my question. – sevenOfNine Oct 07 '14 at 01:32
In the modified code, I put "volatile". Without "volatile", the code on ideone.com/vYWdjG. shows 2500nsec for addition and divistion. So, the same execution time for both calculation may be caused by optimization skipping the loop or something, I think. – sevenOfNine Oct 07 '14 at 01:42
I turned off the optimization by "CFLAGS += -O0" in Makefile. Then, the result was 1.71sec for addition and 3.32sec for division on Zynq(ARM). This becomes what I expected. – sevenOfNine Oct 07 '14 at 04:02
@sevenOfNine add assembly listing of -O2 case. – auselen Oct 07 '14 at 05:53
@auselen Thank you for comment. The assemly listing with -O0 is http://ideone.com/VUx64n. Those with -O2 is http://ideone.com/YQ1q1p. Both was compiled from this C source http://ideone.com/vYWdjG. I am not good at assembly language. – sevenOfNine Oct 07 '14 at 06:26
@sevenOfNine see my first comment to question. Those functions are meaningless with -O2. You're trying to benchmark a machine but don't know how it works. That's you need to fix. – auselen Oct 07 '14 at 07:23
@auselen In your first comment, you showed me the link to the goo.gl/AeOq0U. In this page, I see with -O2, only "bx lr". So, this means that just return from function as you commented. – sevenOfNine Oct 07 '14 at 07:36
1

This is what you probably want to measure http://goo.gl/o4gcmh adding `volatile` keyword adds so big overhead. – j123b567 Oct 07 '14 at 13:25
Granularity test shows you, that you can't measure shorter time then 2500ns. Everything shorter will be measured as this time. – j123b567 Oct 07 '14 at 13:34
Thank you very much for your comments, j123b567. The code you showed me is exactly what I was looking for. Now I understand that without "return fval", the O2 option make the code just do "bx lr". And thank you for the notice that volatile has big overhead, which I did know. Thank you also for the comment on the granularity. – sevenOfNine Oct 07 '14 at 23:43
I updated the code http://ideone.com/MSUTIB. Then, results with -O0 and -O2 are same. Thank all of you. – sevenOfNine Oct 08 '14 at 00:12

32bit floating division is not as slow as I expected

1 Answers1