Effect of data type of arguments in function call on processing speed

Question

uint16_t fn1 (uint8_t a, uint8_t b) { return (a+b) }  
uint32_t fn2 (uint32_t a, uint32_t b) { return (a+b) }

Which is more faster on a 8-bit and a 32-bit micro controller?
Any equivalent assembly code to demonstrate the difference.

For example, take 8-bit Renesas 78K family micro controllers with CA78K0 compiler and 32-bit Renesas rx600 family micro controllers with RX-CC Compiler.

SO is not a place to dump homework. would be nice to tell us what C compiler you use, and/or which microcontroller. What do you think is the correct answer? Why do you think so? — Tommylee2k, Jan 03 '18 at 07:37
if you do the test properly you will find a range of results, not a concrete single answer for each implementation. Is this question related to efficiency of matching the operands to the variable sizes? although you didnt? — old_timer, Jan 03 '18 at 11:39
you need to show the experiments you ran and your results, your expectations, and then ask why the results didnt match what you expected for this to be a stackoverflow quesiton. — old_timer, Jan 03 '18 at 11:40

Peter Cordes · Answer 1 · 2018-01-03T21:57:57.127

0

In general, using a type that matches the register width leads to the most efficient code. (Or at least one of the native operand sizes, e.g. 32 bits is usually fine on most 64-bit architectures.)

Narrower types are almost never more efficient (unless you have an array of them, in which case cache footprint / memory bandwidth favour narrow types). But depending what you're doing, they might not be any less efficient.

In your updated question with specific trivial functions, the answer is very similar to the first version of the question (with empty functions): inlining one add is cheaper than the overhead of a call, so it's profitable for the compiler to always inline this function, and hopefully it will unless you need to pass its address as a callback or something.

A non-inline version of this uint16_t = uint8_t + uint8_t function depends mostly on calling convention, on a 32-bit CPU.

If the args can have high garbage (in the top 24 bits), you have to zero-extend each input separately because you need a correct 9-bit result, not just 8 (but the 9th bit of the inputs could hold garbage).

If the 8-bit args are already zero-extended into 32-bit registers (like ARM32's standard calling convention), or you have to load them from memory anyway (so you can use zero-extending byte loads) then no extra work is needed for either function on a 32-bit CPU. A 32-bit add will correctly produce a 16-bit result. In fact, that linked ARM question is another good example of the effect of the calling convention on the cost of narrow integer operations, or on the need for zeroing out high garbage when inlining with something more complex than add.

If the return type had also been uint8_t, ARM would need an extra instruction to zero-extend the add result (to clear carry into the 9th bit).

There is no "general" answer to the question for other functions. It might or might not cost extra to use narrow function args / return values on 32-bit CPUs. It depends on the function, and the calling convention, and the instruction set. (You mentioned something in comments about a calling convention that packs narrow args into separate bytes of a single register on the 32-bit CC-RX compiler? That sounds inefficient, and would penalize narrow args a lot. That's unusual, though. x86-64 System V only does that for narrow struct members when passing a struct by value, not for separate args.)

Obviously purely 8-bit CPUs like AVR need ADD / ADC x3 to add uint32_t args, so it's obviously much more expensive there. (And note that int is 16 bits on AVR, so only use uint32_t if you really need 32.)

The following was an answer to the first version of the question, when the functions were empty. A later edit changed the question to ask what the OP apparently intended to ask in the first place.

They're both equal speed because both empty functions inline and optimize away to no instructions. The call site doesn't even have to calculate the arg values, so the multi-register args of fn2 doesn't actually cost 4 instructions per arg.

You are putting trivial functions in header files, or using link-time optimization, right?

And yes, even for a microcontroller where code-size is critical, it is profitable for a compiler to inline functions when their body is smaller than instruction setup + the call. Thus, it is reasonable to expect a compiler to make this optimization even with -Os or whatever.

edited Jan 03 '18 at 21:57

answered Jan 03 '18 at 08:03

Peter Cordes

328,167
45
605
847

I'm not positive, but I think what he is getting at is the default register size and whether there is any cost to access a `uint8_t` where the default register size is `32` and vice versa. Essentially saying that `{...}` really have stuff in them. I could be reading it wrong though. Anyway, that is a great answer for the question -- as written. – David C. Rankin Jan 03 '18 at 08:11
@DavidC.Rankin you got it right, i have edited with few more details. – Chandan Kumar Jan 03 '18 at 08:46
In that case, (and note, I'm no micro-controller or compiler expert), the issue is more one of code generation by the compiler than one of a general speed or penalty from using a full or partial register operation. (It's clear 32-bit operations on an 8-bit box, it will require more registers or instructions than for comparable 8-bit operations, e.g. a native 8-bit `add`, etc..). I'll have to let the experts fill in more than that. – David C. Rankin Jan 03 '18 at 09:07
@DavidC.Rankin I understand that if 32-bit data has to be processed by an 8-bit processor, compiler will generate more code than that of 32-bit processor. But, what would happen in opposite case, when 32-bit processor has to process 8-bit data. – Chandan Kumar Jan 03 '18 at 09:23
@DavidC.Rankin you are probably reading it well, problem is that it's not answerable that way. To reason about performance you have to have valid code first, some tasks/algorithms will be faster with 8 bit arguments, and some with 32 bit arguments, both on 8 and on 32 bit CPU. So Peter's answer is technically correct sort-of-trolling... or more like trying to get the OP on the right path of bothering with real problems... OP: anyway, the fastest code is the one which does not exist. That's actually not a joke, but real advice. Also such code tends to be bug-free. – Ped7g Jan 03 '18 at 09:35
1

@ChandanKumar when 32 bit processor has to access 8 bit data, it **usually** does convert them to 32 bit values. As you are dealing with C and not assembly, it's internal thing of the compiler, particular source with particular compiler and options will generate particular code. Any combination is possible. Better compilers may even notice situations where packed usage is possible and use 32b register for four 8b values, etc... Your question without real code is pretty much "anything is possible"/"too broad". – Ped7g Jan 03 '18 at 09:40
2

@ChandanKumar: Usually `uint8_t` is very cheap on a 32-bit CPU. [Many common operations work with garbage in the high bits](https://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in), so the compiler doesn't have to zero-extend between every operation, only before stuff like right-shift or division. On a load/store machine (like ARM), you don't even lose out on the possibility of using a memory source operand (x86 can use memory source operands even with 8-bit or 16-bit operands, but probably some would need an 8->32 load.) – Peter Cordes Jan 03 '18 at 09:57
1

@Ped7g: Yes, the original question was too broad if you interpret it the way I think it was intended, so instead I answered the literal question as written to point out the problem :P It's still not a great question; as is so often the case with optimization "it depends", perhaps on the surrounding code, but at least on the specific code in the function, as you say. I don't think you can say that one way is *always* faster than the other, although native register size is often good; bigger operands often only hurt when they take more memory bandwidth / cache footprint, or SIMD / SWAR point. – Peter Cordes Jan 03 '18 at 10:00
For example, we can consider this function as an adder which will add 'a' and 'b' and return back 'sum'. As per the 32-bit CC-RX compiler document, in general, when I call this function on 32-bit processor, 'a' will be pushed on 1st byte of 32-bit long register and 'b' on the 2nd byte. Now to add 'a' and 'b', compiler must generate instructions with bit mask to process them as 32-bit long data. So, i think it will add extra instructions and will take more time to execute but will save some space on stacks for further functions calls. – Chandan Kumar Jan 03 '18 at 10:13
3

@ChandanKumar the adder will get inlined. And with the surrounding code being aware of it, the values may be already in registers in a format favourable for simple `add` instruction, and using result without further processing, without bit masking it down to 8 bits. It really depends on everything around. Generally using native-width is better, but there's little point to forcefully use 32b wide char on 32b CPU for string processing for example, that would hurt performance. And if the algorithm does need 32b precision, it's worth it, even on 8b CPU. – Ped7g Jan 03 '18 at 10:28
1

@ChandanKumar to get good (close to optimal) performance you need first good algorithm and data structure fitting the target platform architecture. For things where 8b is enough a well designed structure will make 8 bit code faster even on 32b CPU. When the structure is wrongly designed, the 8b data processing on 32b CPU will be slower than version simply replacing 8b types with 32b types. The 8b CPU will need extra instructions for 32b types always, but if the task needs 32b, it may be better to leave 32b types in code and let C compiler deal with it, instead of doing 32b math with 8b types. – Ped7g Jan 03 '18 at 10:31
@Ped7g Could you give any example to justify my question as you are saying it will depend upon surrounding code? If we pass a string with 20 chars and we want to add '1' to each chars and store them in a global string, how data processing will take place? – Chandan Kumar Jan 03 '18 at 10:40
@ChandanKumar with a well designed code... fixed 20 byte size and aligned string start address you can do `((32b *)string)[0..4] += 0x01010101;` on 32b CPU. I.e. in this case (adding 1 to each) processing it as 32b type enclosing "packed" 8b values is better. - this actually assumes the strings are valid 7b ASCII, i.e. +1 will not overflow (didn't realize that assumption until I re-read my comment after posting, it's too automatic, as I heard "chars", I assume ASCII string, unless you specify it more). – Ped7g Jan 03 '18 at 10:49
@ChandanKumar to make it clear, it annoys me you are trying somehow to get simple answer and changing the question, thinking you can get simple answer, while there is no simple answer at all. Pick some real task, write C code, then post it for code review. If the task is complex enough, most likely somebody will point out better algorithm / data structure completely jumping above your low-level problems with 8b vs 32b. That's interesting only on very critical code paths with thousands/millions of iterations per second, and must be treated case by case. **Generally** use native data size. – Ped7g Jan 03 '18 at 10:55
@Ped7g I got it, my question was in general and not specific to any particular scenario, I was trying to make it more clear by editing it. Actually, I was asked this question in an interview recently. I will do some experiments with multiple scenario to understand it in more detail. Thanks for the inputs. – Chandan Kumar Jan 03 '18 at 11:05
@ChandanKumar I was once asked at interview about undefined behaviour, IIRC `y = f(x++, ++x);` and the other side did expect particular answer.... it happens. And generally the native type is usually better, but at the moment you have particular code, you can forget about general. ;) :) good luck, doing some experiments will surely help too. – Ped7g Jan 03 '18 at 11:19
@PeterCordes dont know why we keep going round and round on this, easy to demonstrate that there can be or is a performance hit with a 32 bit architecture doing 8 bit operations, sign extensions and masking are required eventually. Same goes for the heavy overhead of doing 32 bit on an 8 bit machine. x86 is the exception not the rule. – old_timer Jan 03 '18 at 15:52
@ChandanKumar - the problem is the the question can't be answered "in general" in any simple way. The CPU doesn't execute C source code: it goes though a compiler first, and so how the functions above translate to machine code depends entirely on the surrounding code. In almost any interesting, properly designed application the function will be inlined, so even looking at the assembly generated for the "plain" function isn't too informative. – BeeOnRope Jan 03 '18 at 18:53
@old_timer: note that the OP's current `fn1` returns a 16-bit integer, so no masking of carry-out is required. ARM can compile it to one instruction with the standard calling convention. – Peter Cordes Jan 03 '18 at 21:53
@ChandanKumar: updated my answer to answer the new question. – Peter Cordes Jan 03 '18 at 21:53
@PeterCordes yep, I commented on differences like that in my answer. I suspect the homework assignment though is the investigation and understanding that there is a cost to variable sizes big or small. or can be, experience and experiments help. – old_timer Jan 03 '18 at 23:46
nothing about this question has anything to do with inlining. if the function happens to be in the optimization domain as a user of it, then sure the compiler may choose to do that, but if linked then that generally doesnt happen, can with llvm if you are crafty with your build and how and when you optimize. likely the exercise is to see that for 8 bit you need two registers or the stack to return and it takes one adc if present in the architecture. for 32 bit it takes more. going the other way is there a cost (big regs, small math). – old_timer Jan 03 '18 at 23:49
@old_timer *"the heavy overhead of doing 32b on an 8b"* - but if your task requires 32 bit precision, then the overhead has to be paid somewhere, even if you would write it with 8b types. Out of context the 8b surely wins easily on 8b CPU, but once you consider the full task, and tax the 8b type usage by building 32b precision with additional code, it's quite likely (50+% IMO) a cleanly written 32b type C source will give compiler better chance to produce more optimal result. Unless you are on expert asm level for the particular CPU, building the 32b math yourself better. – Ped7g Jan 04 '18 at 09:38

old_timer · Answer 2 · 2018-01-04T00:27:02.780

Just try it for yourself. Inst that the whole point of the homework assignment?

unsigned char fun1 ( unsigned char a, unsigned char b)
{
    return (a+b);
}  
unsigned int fun2 ( unsigned int a, unsigned int b)
{
    return (a+b);
}
unsigned long fun3 ( unsigned long a, unsigned long b)
{
    return (a+b);
}

Dont have/want your compiler, but it should be obvious before you start this exercise what is going to happen. If you mismatch the high level language variable/operand size to the architecture either the architecture has a solution directly or the architecture has to mask or cascade.

I have shown an array of different architectures here:

00000000 <fun1>:
   0:   e0800001    add r0, r0, r1
   4:   e20000ff    and r0, r0, #255    ; 0xff
   8:   e12fff1e    bx  lr

0000000c <fun2>:
   c:   e0800001    add r0, r0, r1
  10:   e12fff1e    bx  lr

00000000 <_fun1>:
   0:   1166            mov r5, -(sp)
   2:   1185            mov sp, r5
   4:   9d40 0004       movb    4(r5), r0
   8:   9d41 0006       movb    6(r5), r1
   c:   6040            add r1, r0
   e:   1585            mov (sp)+, r5
  10:   0087            rts pc

00000012 <_fun2>:
  12:   1166            mov r5, -(sp)
  14:   1185            mov sp, r5
  16:   1d40 0004       mov 4(r5), r0
  1a:   6d40 0006       add 6(r5), r0
  1e:   1585            mov (sp)+, r5
  20:   0087            rts pc

00000022 <_fun3>:
  22:   1166            mov r5, -(sp)
  24:   1185            mov sp, r5
  26:   1d40 0004       mov 4(r5), r0
  2a:   1d41 0006       mov 6(r5), r1
  2e:   6d40 0008       add 10(r5), r0
  32:   6d41 000a       add 12(r5), r1
  36:   0b40            adc r0
  38:   1585            mov (sp)+, r5
  3a:   0087            rts pc

00000000 <fun1>:
   0:   86 0f           add r24, r22
   2:   08 95           ret

00000004 <fun2>:
   4:   86 0f           add r24, r22
   6:   97 1f           adc r25, r23
   8:   08 95           ret

0000000a <fun3>:
   a:   62 0f           add r22, r18
   c:   73 1f           adc r23, r19
   e:   84 1f           adc r24, r20
  10:   95 1f           adc r25, r21
  12:   08 95           ret

00000000 <fun1>:
   0:   4f 5e           add.b   r14,    r15 
   2:   30 41           ret         

00000004 <fun2>:
   4:   0f 5e           add r14,    r15 
   6:   30 41           ret         

00000008 <fun3>:
   8:   0e 5c           add r12,    r14 
   a:   0f 6d           addc    r13,    r15 
   c:   30 41           ret         

0000000000000000 <fun1>:
   0:   12001c21    and w1, w1, #0xff
   4:   0b200020    add w0, w1, w0, uxtb
   8:   d65f03c0    ret
   c:   d503201f    nop

0000000000000010 <fun2>:
  10:   0b010000    add w0, w0, w1
  14:   d65f03c0    ret

00000000 <fun1>:
   0:   00851021    addu    $2,$4,$5
   4:   03e00008    jr  $31
   8:   304200ff    andi    $2,$2,0xff

0000000c <fun2>:
   c:   03e00008    jr  $31
  10:   00851021    addu    $2,$4,$5

See what the compilers in the homework assignment produce for your specific code as I modified it a little bit (and fixed the bugs), my inputs and outputs match, having them mismatch just adds more fun to this assignment.

EDIT

yes 8 bits plus 8 bits is 9 bits and that fits just fine in 16 or 18 or 24,32, 36, 64, etc. although a number of those dont have a lot of registers and work through memory where you get different advantages.

00000000 <ufun>:
   0:   e0800001    add r0, r0, r1
   4:   e12fff1e    bx  lr

00000008 <sfun>:
   8:   e0800001    add r0, r0, r1
   c:   e12fff1e    bx  lr


0000000000000000 <ufun>:
   0:   12001c21    and w1, w1, #0xff
   4:   0b200020    add w0, w1, w0, uxtb
   8:   d65f03c0    ret
   c:   d503201f    nop

0000000000000010 <sfun>:
  10:   13001c21    sxtb    w1, w1
  14:   0b208020    add w0, w1, w0, sxtb
  18:   d65f03c0    ret
  1c:   d503201f    nop

00000000 <ufun>:
   0:   03e00008    jr  $31
   4:   00851021    addu    $2,$4,$5

00000008 <sfun>:
   8:   03e00008    jr  $31
   c:   00851021    addu    $2,$4,$5

00000000 <ufun>:
   0:   4f 4f           mov.b   r15,    r15 
   2:   4e 4e           mov.b   r14,    r14 
   4:   0f 5e           add r14,    r15 
   6:   30 41           ret         

00000008 <sfun>:
   8:   8f 11           sxt r15     
   a:   8e 11           sxt r14     
   c:   0f 5e           add r14,    r15 
   e:   30 41           ret         

00000000 <ufun>:
   0:   70 e0           ldi r23, 0x00   ; 0
   2:   26 2f           mov r18, r22
   4:   37 2f           mov r19, r23
   6:   28 0f           add r18, r24
   8:   31 1d           adc r19, r1
   a:   82 2f           mov r24, r18
   c:   93 2f           mov r25, r19
   e:   08 95           ret

00000010 <sfun>:
  10:   06 2e           mov r0, r22
  12:   00 0c           add r0, r0
  14:   77 0b           sbc r23, r23
  16:   26 2f           mov r18, r22
  18:   37 2f           mov r19, r23
  1a:   28 0f           add r18, r24
  1c:   31 1d           adc r19, r1
  1e:   87 fd           sbrc    r24, 7
  20:   3a 95           dec r19
  22:   82 2f           mov r24, r18
  24:   93 2f           mov r25, r19
  26:   08 95           ret

00000000 <_ufun>:
   0:   1166            mov r5, -(sp)
   2:   1185            mov sp, r5
   4:   9d41 0004       movb    4(r5), r1
   8:   45c1 ff00       bic $-400, r1
   c:   9d40 0006       movb    6(r5), r0
  10:   45c0 ff00       bic $-400, r0
  14:   6040            add r1, r0
  16:   1585            mov (sp)+, r5
  18:   0087            rts pc

0000001a <_sfun>:
  1a:   1166            mov r5, -(sp)
  1c:   1185            mov sp, r5
  1e:   9d41 0004       movb    4(r5), r1
  22:   9d40 0006       movb    6(r5), r0
  26:   6040            add r1, r0
  28:   1585            mov (sp)+, r5
  2a:   0087            rts pc

00000000 <ufun>:
   0:   952e                    add x10,x10,x11
   2:   8082                    ret

00000004 <sfun>:
   4:   952e                    add x10,x10,x11
   6:   8082                    ret

A variety of different instruction sets, which ones dont really matter, exercise for the reader if interested, point being how the problem is solved tends to follow a couple-three patterns. mask before, mask after, have special instructions that essentially mask (do a byte add against 16 bit registers). Likewise on the other side generally add then adc as needed, not really much of a choice otherwise (well if you have no C flag and no ADC then you have to solve some other way). — old_timer, Jan 03 '18 at 21:39

Effect of data type of arguments in function call on processing speed

2 Answers2