2

Consider the following code:

    int bn_div(bn_t *bn1, bn_t *bn2, bn_t *bnr)
  {
    uint32 q, m;        /* Division Result */
    uint32 i;           /* Loop Counter */
    uint32 j;           /* Loop Counter */

    /* Check Input */
    if (bn1 == NULL) return(EFAULT);
    if (bn1->dat == NULL) return(EFAULT);
    if (bn2 == NULL) return(EFAULT);
    if (bn2->dat == NULL) return(EFAULT);
    if (bnr == NULL) return(EFAULT);
    if (bnr->dat == NULL) return(EFAULT);


    #if defined(__i386__) || defined(__amd64__)
    __asm__ (".intel_syntax noprefix");
    __asm__ ("pushl %eax");
    __asm__ ("pushl %edx");
    __asm__ ("pushf");
    __asm__ ("movl %eax, (bn1->dat[i])");
    __asm__ ("xorl %edx, %edx");
    __asm__ ("divl (bn2->dat[j])");
    __asm__ ("movl (q), %eax");
    __asm__ ("movl (m), %edx");
    __asm__ ("popf");
    __asm__ ("popl %edx");
    __asm__ ("popl %eax");
    #else
    q = bn->dat[i] / bn->dat[j];
    m = bn->dat[i] % bn->dat[j];
    #endif
    /* Return */
    return(0);
  }

The data types uint32 is basically an unsigned long int or a uint32_t unsigned 32-bit integer. The type bnint is either a unsigned short int (uint16_t) or a uint32_t depending on if 64-bit data types are available or not. If 64-bit is available, then bnint is a uint32, otherwise it's a uint16. This was done in order to capture carry/overflow in other parts of the code. The structure bn_t is defined as follows:

typedef struct bn_data_t bn_t;
struct bn_data_t
  {
    uint32 sz1;         /* Bit Size */
    uint32 sz8;         /* Byte Size */
    uint32 szw;         /* Word Count */
    bnint *dat;         /* Data Array */
    uint32 flags;       /* Operational Flags */
  };

The function starts on line 300 in my source code. So when I try to compile/make it, I get the following errors:

system:/home/user/c/m3/bn 1036 $$$ ->make
clang -I. -I/home/user/c/m3/bn/.. -I/home/user/c/m3/bn/../include  -std=c99 -pedantic -Wall -Wextra -Wshadow -Wpointer-arith -Wcast-align -Wstrict-prototypes  -Wmissing-prototypes -Wnested-externs -Wwrite-strings -Wfloat-equal  -Winline -Wunknown-pragmas -Wundef -Wendif-labels  -c /home/user/c/m3/bn/bn.c
/home/user/c/m3/bn/bn.c:302:12: warning: unused variable 'q' [-Wunused-variable]
    uint32 q, m;        /* Division Result */
           ^
/home/user/c/m3/bn/bn.c:302:15: warning: unused variable 'm' [-Wunused-variable]
    uint32 q, m;        /* Division Result */
              ^
/home/user/c/m3/bn/bn.c:303:12: warning: unused variable 'i' [-Wunused-variable]
    uint32 i;           /* Loop Counter */
           ^
/home/user/c/m3/bn/bn.c:304:12: warning: unused variable 'j' [-Wunused-variable]
    uint32 j;           /* Loop Counter */
           ^
/home/user/c/m3/bn/bn.c:320:14: error: unknown token in expression
    __asm__ ("movl %eax, (bn1->dat[i])");
             ^
<inline asm>:1:18: note: instantiated into assembly here
        movl %eax, (bn1->dat[i])
                        ^
/home/user/c/m3/bn/bn.c:322:14: error: unknown token in expression
    __asm__ ("divl (bn2->dat[j])");
             ^
<inline asm>:1:12: note: instantiated into assembly here
        divl (bn2->dat[j])
                  ^
4 warnings and 2 errors generated.
*** [bn.o] Error code 1

Stop in /home/user/c/m3/bn.
system:/home/user/c/m3/bn 1037 $$$ ->

What I know:

I consider myself to be fairly well versed in x86 assembler (as evidenced from the code that I wrote above). However, the last time that I mixed a high level language and assembler was using Borland Pascal about 15-20 years ago when writing graphics drivers for games (pre-Windows 95 era). My familiarity is with Intel syntax.

What I don't know:

How do I access members of bn_t (especially *dat) from asm? Since *dat is a pointer to uint32, I am accessing the elements as an array (eg. bn1->dat[i]).

How do I access local variables that are declared on the stack?

I am using push/pop to restore clobbered registers to their previous values so as to not upset the compiler. However, do I also need to include the volatile keyword on the local variables as well?

Or, is there a better way that I am not aware of? I don't want to put this in a separate function call because of the calling overhead as this function is performance critical.

Additional:

Right now, I'm just starting to write this function so it is no where complete. There are missing loops and other such support/glue code. But, the main gist is accessing local variables/structure elements.

EDIT 1:

The syntax that I am using seems to be the only one that clang supports. I tried the following code and clang gave me all sorts of errors:

__asm__ ("pushl %%eax",
    "pushl %%edx",
    "pushf",
    "movl (bn1->dat[i]), %%eax",
    "xorl %%edx, %%edx",
    "divl ($0x0c + bn2 + j)",
    "movl %%eax, (q)",
    "movl %%edx, (m)",
    "popf",
    "popl %%edx",
    "popl %%eax"
    );

It wants me to put a closing parenthesis on the first line, replacing the comma. I switched to using %% instead of % because I read somewhere that inline assembly requires %% to denote CPU registers, and clang was telling me that I was using an invalid escape sequence.

Daniel Rudy
  • 1,411
  • 12
  • 23
  • 1
    Are you aware that the compiler might reorder the `__asm__` statements with respect to other statements? I'm very confident this is not wanted, so use a **single** `__asm__` statement. – too honest for this site Sep 23 '15 at 13:33
  • 1
    "The data types uint32 is basically an unsigned long int" No, it is not. It is basically an unsigned integer type guaranteed to be 32 bits wide. – too honest for this site Sep 23 '15 at 13:34
  • I tried to use a single __asm__ statement and the compiler threw it back at me. I'll try again. – Daniel Rudy Sep 23 '15 at 13:59
  • Did you add newlines? – too honest for this site Sep 23 '15 at 14:17
  • Yes, I did. clang gave me all sorts of errors. I'll update the question to reflect this. – Daniel Rudy Sep 23 '15 at 14:27
  • 1
    Please read the documentation. I do not know clang, but for gcc, you have to specify the C arguments with additional parameters (and afaik clang is similar). Basically, the strings are passed to the assembler with some textual replacement (if you specify the C parameters) and the assembler obviously has no idea about C constructs. – too honest for this site Sep 23 '15 at 14:36
  • 1
    [gcc inline assembly](https://gcc.gnu.org/onlinedocs/gcc-5.2.0/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C) (also used by clang) doesn't check the assembly statement(s). There's a good tutorial [here](http://locklessinc.com/articles/gcc_asm/). – Brett Hale Sep 23 '15 at 17:34
  • I think nobody ever addressed the attempt at a "multi-line" asm statement. A single `asm` statement still only takes one string, but the string can contain newlines. So `asm(` `"foo\n\t"` `"bar\n\t"` ... `: constraints);`. What you've done is write a comma-separated list of string-literals. This is either a massive error itself, or the C comma operator will evaluate to only the last string. What you want is for string literals separated only by whitespace (e.g. on separate lines) to be pasted together at compile time into one. – Peter Cordes Oct 27 '16 at 10:04

1 Answers1

7

If you only need 32b / 32b => 32bit division, let the compiler use both outputs of div, which gcc, clang and icc all do just fine, as you can see on the Godbolt compiler explorer:

uint32_t q = bn1->dat[i] / bn2->dat[j];
uint32_t m = bn1->dat[i] % bn2->dat[j];

Compilers are quite good at CSEing that into one div. Just make sure you don't store the division result somewhere that gcc can't prove won't affect the input of the remainder.

e.g. *m = dat[i] / dat[j] might overlap (alias) dat[i] or dat[j], so gcc would have to reload the operands and redo the div for the % operation. See the godbolt link for bad/good examples.


Using inline asm for 32bit / 32bit = 32bit div doesn't gain you anything, and actually makes worse code with clang (see the godbolt link).

If you need 64bit / 32bit = 32bit, you probably need asm, though, if there isn't a compiler built-in for it. (GNU C doesn't have one, AFAICT). The obvious way in C (casting operands to uint64_t) generates a call to a 64bit/64bit = 64bit libgcc function, which has branches and multiple div instructions. gcc isn't good at proving the result will fit in 32bits, so a single div instruction don't cause a #DE.

For a lot of other instructions, you can avoid writing inline asm a lot of the time with builtin functions for things like popcount. With -mpopcnt, it compiles to the popcnt instruction (and accounts for the false-dependency on the output operand that Intel CPUs have.) Without, it compiles to a libgcc function call.

Always prefer builtins, or pure C that compiles to good asm, so the compiler knows what the code does. When inlining makes some of the arguments known at compile-time, pure C can be optimized away or simplified, but code using inline asm will just load constants into registers and do a div at run-time. Inline asm also defeats CSE between similar computations on the same data, and of course can't auto-vectorize.


Using GNU C syntax the right way

https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html explains how to tell the assembler which variables you want in registers, and what the outputs are.

You can use Intel/MASM-like syntax and mnemonics, and non-% register names if you like, preferably by compiling with -masm=intel. The AT&T syntax bug (fsub and fsubr mnemonics are reversed) might still be present in intel-syntax mode; I forget.

Most software projects that use GNU C inline asm use AT&T syntax only.

See also the bottom of this answer for more GNU C inline asm info, and the tag wiki.


An asm statement takes one string arg, and 3 sets of constraints. The easiest way to make it multi-line is by making each asm line a separate string ending with \n, and let the compiler implicitly concatenate them.

Also, you tell the compiler which registers you want stuff in. Then if variables are already in registers, the compiler doesn't have to spill them and have you load and store them. Doing that would really shoot yourself in the foot. The tutorial Brett Hale linked in comments hopefully covers all this.


Correct example of div with GNU C inline asm

You can see the compiler asm output for this on godbolt.

uint32_t q, m;  // this is unsigned int on every compiler that supports x86 inline asm with this syntax, but not when writing portable code.

asm ("divl %[bn2dat_j]\n"
      : "=a" (q), "=d" (m) // results are in eax, edx registers
      : "d" (0),           // zero edx for us, please
        "a" (bn1->dat[i]), // "a" means EAX / RAX
        [bn2dat_j] "mr" (bn2->dat[j]) // register or memory, compiler chooses which is more efficient
      : // no register clobbers, and we don't read/write "memory" other than operands
    );

"divl %4" would have worked too, but named inputs/outputs don't change name when you add more input/output constraints.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I have never messed with this before. Never had the need to do so, until now. This AT&T syntax for asm is atrocious at best since all the asm work that I've done was using things like MASM and TASM. There are other issues besides this that need to be addressed, but that's a separate question. – Daniel Rudy Sep 26 '15 at 21:09
  • 1
    @DanielRudy: well if the other cases are like this one, just let the compiler do the right thing: See the last paragraph of my answer. gcc inline asm is really messy and hard to learn, with the input/output constraints, but at least it lets you write code that isn't stupid like spilling a variable to memory just so MSVC inline asm can load it with `mov`. – Peter Cordes Sep 26 '15 at 21:26
  • @DanielRudy: In many cases, there are compiler intrinsics for bit-scan / rotate / other stuff which saves the trouble of writing them with inline asm. The compiler already knows whether they can take memory operands, and all that stuff. https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins lists the non-vector builtins for things like popcount, which will use the `popcnt` instruction if available. – Peter Cordes Sep 26 '15 at 21:31
  • 1
    I did not know that the compiler was smart enough to realize that using / and % in consecutive statements caused the compiler to use both results in one div operation. That pretty much blows the need to use asm right out of the water in the first place. – Daniel Rudy Sep 26 '15 at 22:09
  • @DanielRudy: yeah, I should have emphasized that more in my first version of my answer. I intended to say that in my first paragraph, but ended up only being obvious if you looked at the asm on godbolt, or at the last paragraph. – Peter Cordes Sep 26 '15 at 22:14
  • I think GCC does generate divl. This works for me: https://stackoverflow.com/a/5608636/124486 – Evan Carroll Mar 05 '18 at 04:44
  • @EvanCarroll: That's what the first section of this answer says, too, with a link to Godbolt to prove it. Inline asm is only interesting for `uint64_t / uint32_t` (or other double-width situations) when you know that the quotient won't ever overflow. Trying to tell the compiler that the quotient won't overflow doesn't seem to get it to use a single `div` with the upper half non-zero. (Or `idiv` with the upper half anything other than sign-extended). – Peter Cordes Mar 05 '18 at 04:50
  • 1
    @EvanCarroll: That's the second section; are you looking at an old revision of this answer or something? The first section says "*If you only need 32b / 32b => 32bit division, let the compiler use both outputs of div*" – Peter Cordes Mar 05 '18 at 04:56
  • I fully see what you're talking about now, and I overlooked the gcc thing when I read it. that's a great example of when inline assembly is needed. I wonder if there is *any* way to do this without inline assembly.. – Evan Carroll Mar 05 '18 at 05:09
  • 1
    @EvanCarroll: With current gcc, not that I know of. In general, an intrinsic that exposes the possibly-faulting nature of div / idiv is one good way. (MSVC has this.) Another way is to teach the compiler how to optimize when it can prove that division can't overflow, e.g. `5 * (uint64_t)a / 111` or something, on a 32-bit machine. i.e. teach the compiler to look for this optimization. (gcc on x86-64 would use a multiplicative inverse, but gcc chooses not to when it would require extended precision.) – Peter Cordes Mar 05 '18 at 05:22