divide and store quotient and reminder in different arrays

Question

The standard div() function returns a div_t struct as parameter, for example:

/* div example */
#include <stdio.h>      /* printf */
#include <stdlib.h>     /* div, div_t */

int main ()
{
  div_t divresult;
  divresult = div (38,5);
  printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
  return 0;
}

My case is a bit different; I have this

#define NUM_ELTS 21433
int main ()
{
  unsigned int quotients[NUM_ELTS];
  unsigned int remainders[NUM_ELTS];
  int i;

  for(i=0;i<NUM_ELTS;i++) {
      divide_single_instruction(&quotient[i],&reminder[i]);
  }
}

I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.

You might try just doing: `quotient[i] = x / y; remainder[i] = x % y;`; with optimization turned up, many compilers will already do what you want. — ShadowRanger, Jan 24 '16 at 16:11
Well `div` code isn't very complicated: https://github.com/Xilinx/eglibc/blob/master/stdlib/div.c — nsilent22, Jan 24 '16 at 16:12
What degree of portability is required? Power PC? SPARC? M680x0? 80386? 8086? M6502? Z80? — Jonathan Leffler, Jan 24 '16 at 16:13
@ShadowRanger can not turst the compiler, because I am unrolling loops, so I need to do like 64 moves from EAX to memory , then about 64 moves from EDX into memory. I don't want the compiler mess my code with loop unrolling, if you would see my code, you would say I am crazy. Just need the plain IDIV instruction converted into simple division function, that's it. — Nulik, Jan 24 '16 at 16:18
@Nurik What is your compiler doing now? Please post the disassembly of the interesting section. — jforberg, Jan 24 '16 at 17:10
@jforberg the compiler is emulating division with bit shifting. What need it to do is to do the division. — Nulik, Jan 24 '16 at 18:47
It may be interesting to know that if the denominators going to be used many times then you can pre-compute some factors (and make these factors into arrays) so that your operations becomes a multiplication and a bit shift which will be much faster than the division instruction. — Z boson, Jan 24 '16 at 19:07
What do you mean by portable? Do you mean to other x86 instructions sets and different OS and different compilers or do you mean portable for example to ARM as well? Many intrinsics are portable to other x86 OS and compilers. If you mean x86 only then I suggest adding the x86 tag. — Z boson, Jan 24 '16 at 19:10
@Z boson , what will happen when I change my code to manage floats? Is the compiler also going to use shift operations ? Almost 30 years gcc is in development and still it can't do such a simple thing like offer an optimized arithmetical instruction! The preprocessor of GCC also sucks. Xxfcdfsdfkjldsf! — Nulik, Jan 24 '16 at 20:38
@nsilent22, isn't that code using the K&R C function conventions? I'm surprised to see that. I guess the code has not changed since 1990 and was written before people used C89 much. — Z boson, Jan 25 '16 at 08:32
@Zboson: Looks like you're right! What a perfect piece of code it has to be then ;) — nsilent22, Jan 25 '16 at 20:23

score 1 · Accepted Answer · edited May 23 '17 at 12:31

Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.

void foo (unsigned *num, unsigned *den, int n) {
    int i;
    for(i=0;i<n;i++) {
        unsigned q = num[i]/den[i], r = num[i]%den[i];   
        num[i] = q, den[i] = r;
    }
}

produces this main loop assembly

.L5:
        movl    (%rdi,%rcx,4), %eax
        xorl    %edx, %edx
        divl    (%rsi,%rcx,4)
        movl    %eax, (%rdi,%rcx,4)
        movl    %edx, (%rsi,%rcx,4)
        addq    $1, %rcx
        cmpl    %ecx, %r8d
        jg      .L5

There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this

for (p = 3; p <= n/p; p += 2)
    if (!(n % p)) return 0;

It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:

for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
    if (!r) return 0;

This speeds up the result by a factor of two.

So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.

score 0 · Answer 2 · answered Jan 24 '16 at 16:45

The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.

Also let's assume you have a good reason for needing to "count cycles".

For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can

struct divresult {
    int quot;
    int rem;
};

struct divresult divrem(int num, int den)
{
    return (struct divresult) { num / den, num % den };
}

int main()
{
    struct divresult res = divrem(5, 2);
    printf("%d, %d", res.quot, res.rem);
}

And the compiler gives us:

    struct divresult res = divrem(5, 2);
    printf("%d, %d", res.quot, res.rem);
01121000  push        1 
01121002  push        2
01121004  push        1123018h  
01121009  call        dword ptr ds:[1122090h] ;;; this is printf()

Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:

int main()
{
    int num, den;
    scanf("%d, %d", &num, &den);
    struct divresult res = divrem(num, den);
    printf("%d, %d", res.quot, res.rem);
}

Now we get:

    struct divresult res = divrem(num, den);
01071023  mov         eax,dword ptr [num]  
01071026  cdq  
01071027  idiv        eax,dword ptr [den]  
    printf("%d, %d", res.quot, res.rem);
0107102A  push        edx  
0107102B  push        eax  
0107102C  push        1073020h  
01071031  call        dword ptr ds:[1072090h] ;;; printf()

So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.

From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.

divide and store quotient and reminder in different arrays

2 Answers2