2

I have an embedded project in Rust on the STM32F446 MCU. Consider the next line:

leds::set_g(self.next_update_time % 2000 == 0)

The modulo is used and reading online, it appears that the Cortex M4 doesn't have a modulo instruction. Instead, a function gets added to the binary that does this in software. Using cargo bloat (based on Google's Bloaty), it can be found.

File  .text    Size                 Crate Name
...
0.1%   6.9%    990B     compiler_builtins __udivmoddi4
...

Much to my surprise, it takes just under a kilobyte of memory. I think that's a lot. The code behind it is quite long as well, see this link. I assume this implementation is made to be fast. Luckily I have the memory to spare.

Using opt-level = 'z' doesn't change this.

But what if I couldn't afford this, how could I let it take up less memory?

Of course resorting to a solution like this would work, but then I'd lose the ability to use the % operator.

Geoxion
  • 448
  • 1
  • 4
  • 14
  • As the other answer says: If you can round up the modulus to a power of 2 (e.g. 2048) the test becomes a simple bit-test which should be one instruction even on an embedded system. E.g. if `next_update_time % 2048 == 0` is true, then the lower 15 bits in `next_update_time` are all zero. – user2722968 Nov 26 '19 at 15:53
  • This is expensive in terms of size because you're using a 64-bit data type on a 32-bit machine and you have to divide across both halves. The 32-bit mod operator is much smaller. – bk2204 Nov 27 '19 at 05:20

1 Answers1

3

Not sure how clever the Rust linker is, but in many embedded linker implementations you would be able to swap in your own implementation of __udivmodi4 which used a smaller (but slower) method in preference to the version provided by the compiler.

In general generic division and modulo are expensive on embedded platforms, but division by a constant can often be specialized with a "fixed" implementation by a smart compiler (often with special cases for common divisors - 3, 5, 7, 10, etc).

If you can control the application then changing the code to divide or modulo by 2^N is obviously preferable (it collapses to either a "right shift" instruction for divide, or an "and" instruction for modulo). E.g. in this case 2048 might be acceptably close to 2000, and turns 1 KB of code into 4 bytes of code.

FWIW the Rust version of this does seem a little on the fat side - the GCC implementation for example is much smaller.

solidpixel
  • 10,688
  • 1
  • 20
  • 33