2

I need to profile different machine instruction for a project, so I'm running some instructions in a loop of ~200 instructions per time (using .rept in an __asm__ directive). The processor I'm using is an ARM Cortex-M4. I need now to test ARM's conditional instructions. If I enter something like

        ".rept 200\n\t"
        "addeq r1, r1, r1\n\t"
        ".endr\n\t"

I get

Error: thumb conditional instruction should be in IT block -- `addeq r1,r1,r1'

Now, IT blocks can have up to 4 instructions, so the best I could do with them is something like

        ".rept 200\n\t"
        "ITTTT EQ\n\t"
        ".rept 4\n\t"
        "addeq r1, r1, r1\n\t"
        ".endr\n\t"
        ".endr\n\t"

yielding a binary like

 80003ae:   bf01        itttt   eq
 80003b0:   1849        addeq   r1, r1, r1
 80003b2:   1849        addeq   r1, r1, r1
 80003b4:   1849        addeq   r1, r1, r1
 80003b6:   1849        addeq   r1, r1, r1

This way, however, 1 in 5 instruction will not be the one I want to profile (causing some noise in the measures I take). Since I heard that IT blocks are enforced by the Thumb-2 ISA, and that complete ARM can use conditional instructions even without them, my question is: can I instruct the assembler to use them? Moreover, if I heard correctly and Thumb-2 requires them, is there a way to further reduce the "noise"? (better than 1/5 instructions?)

Thanks!


EDIT: I got a lot of useful comments (thanks!), but I realized I missed some important information to better understand my goal, I apologize for that. I'm trying to profile the power consumption of the CPU, so effectively it does a difference if the IT block is "executed" or not, which is the resulting binary encoding ecc., while the clock cycles needed are not the focus here.

I think this means (but correct me if I'm wrong) that even if Thumb-2 cleverly hides the IT block complexity, I should see a power difference, multimeter at hand.

  • 1
    Write the code for the benchmark in assembly and call it from C. – Erik Eidt Feb 16 '23 at 16:08
  • The execution time for all instructions is documented in the Cortex-M4 Technical Reference Manual. – Elliot Alderson Feb 16 '23 at 17:15
  • As per [ARM IT conditional](https://stackoverflow.com/questions/25991476/arm-it-conditional-instruction-assembler-armcc). There I explain with examples the aspect of ARM versus thumb2 conditionals. The information is duplicated in "fuz's" post here. `addeq` is a notational convention in .unified assembler. The instruction encoding does not exist. I suggest that you time 'opcodes'; ie, straight binary values. Your project instructions are faulty at finding faulty instructions and timings. – artless noise Feb 16 '23 at 17:23
  • If you're trying to test throughput, use different registers for the different add instructions; each of your `addeq` reads the result of the previous one, so no CPU could run it at more than 1 instruction per clock cycle. (Unless the `eq` condition was false, or with unusual CPU internals like Pentium 4's double-pumped ALUs with half-cycle latency.) I guess on a scalar CPU, this benchmark could test whether `it` takes a separate cycle to decode vs. working like a "prefix" that's decoded as part of an `addeq`. – Peter Cordes Feb 16 '23 at 21:04
  • Sounds like you're thinking of the fact that in the "classic ARM" A32 architecture, every instruction can be made conditional via bits within its own encoding, not needing any prefix or "blocks". This does you no good because the M4 doesn't support the A32 instruction set, only Thumb. For Thumb they removed the "every instruction conditional" feature because the goal was smaller instructions, and so you get the more restrictive IT/ITT mechanism instead. – Nate Eldredge Feb 17 '23 at 05:33
  • You can ask the *assembler* to use the A32 instruction set with the `.arm` directive, if I recall correctly, but it's pointless if your *CPU* can't decode and execute those instructions. It would try to decode them as Thumb instead, and most likely just crash. – Nate Eldredge Feb 17 '23 at 05:35
  • @ErikEidt you mean compiling a separate file and linking it from the C code? @ElliotAlderson I added an edit to the question. @artlessnoise I think I may still see a difference, though your suggestion of looking the opcodes is good. @PeterCordes that's something to consider too, thanks. @NateEldredge are you sure the M4 doesn't read A32? Because some instructions get actually assembled to 32 bit words (e.g., `add r0, r1, #256` becomes `add.w r0, r0, #256`, with bin encoding `f500 7080`) – Alessandro Bertulli Feb 17 '23 at 11:15
  • 1
    @AlessandroBertulli: I haven't personally tested one, but I'm pretty sure. https://developer.arm.com/Processors/Cortex-M4 mentions only the Thumb/Thumb2 ISA, and indeed A32 is completely written out of the [ARMv7-M spec](https://developer.arm.com/documentation/ddi0403/latest). Current Thumb (aka T32) is a mixed 16/32-bit instruction set so yes, some instructions are 32 bits. But the conditional feature is still gone. – Nate Eldredge Feb 17 '23 at 14:40
  • 1
    @AlessandroBertulli: And `f500 7080` is the Thumb encoding of `add.w r0, r0, #256` (encoding T3 in the manual, to be specific). The A32 encoding would be `e2800c01`. – Nate Eldredge Feb 17 '23 at 14:44
  • 1
    @AlessandroBertulli, yes, multiple object files, one from assembly, one from C/C++. Inline assembly has its own unique headaches, which can be avoided this way. – Erik Eidt Feb 17 '23 at 15:39

1 Answers1

3

The IT instruction is what makes the subsequent instructions conditional. You'll find that if you remove it, the instructions become unconditional. This is how the instruction encoding works. You can think of the IT block as a prefix to one or more instructions, modifying their behaviour to possibly no longer set flags and to execute conditionally. If you remove the prefix, execution is no longer conditional.

For a benchmark, I'd use IT blocks with one instruction each as that is the most common use case. Some ARM processors have a decoder with special support for this case, parsing the IT instruction and subsequent conditional instruction as one in the usual cases.

The Cortex-M4 on the other hand does something else: if the preceding instruction (!) is a 16 bit instruction, the IT instruction is folded into it and effectively executed in zero cycles. This may solve your measurement problem for the case of measuring 16 bit conditional execution at least.

Another thing you could do is to run the benchmark with IT blocks of different sizes and to then use arithmetic to compute how long the IT instructions took. Then you can remove that time from the total runtime. Generally speaking, conditionally executed instructions take the same time as unconditional instructions, though there may be exceptions (e.g. for control transfer instructions or those that write PC by some other means).

fuz
  • 88,405
  • 25
  • 200
  • 352
  • 1
    I think your advice is good (use IT with four conditionals), but it is assuming a goal, which the OP never stated. Another option is to use the straight binary encoding. This can be achieved with assembler macros to give a human readable name to the opcode. It think the aha, is that `addeq` does **NOT** exist as a thumb2 opcode. It is a pseudo-op for just `adds` in thumb2. Which begs which sort of layer is actually being tested. The machine or the assembler? It is not easy (or rational) to test both. – artless noise Feb 16 '23 at 17:27
  • @artlessnoise It also depends on the CPU model. Some implement IT as an instruction, other as a prefix. Some implement the following operations differently, others just skip them. – fuz Feb 16 '23 at 17:30
  • 2
    @artlessnoise I am talking about Thumb2 instructions, correct. I am not sure if any Cortex-M cores do so, but some processors parse an IT instruction followed by a single conditional instruction as one big instruction, effectively treating the IT instruction as a prefix. – fuz Feb 16 '23 at 17:59
  • I see. A 16 bit prefix versus ARM 4bits. It is consumed by the CPU at the same time. Whether it is multi-issue or a prefix might be questionable, but I get your meaning The encoding is the same, but the point will be important if the goal is timing. – artless noise Feb 17 '23 at 00:06