32-bit signed integer multiplication without using 64-bit data type

Question

I want to do 32-bit signed integer multiplication without using a 64-bit data type. My inputs are in Q1.31 (both) format.

input1 = A32 (Ah Al) - higher, lower half's of A32
input2 = B32 (Bh Bl) - higher, lower half's of B32

Result should be in Q1.31 format, leave the overflow case.

I need C code. Please provide the explanation with formats also.

Anyway, multiply Ah and Bh together, and shift left the right amount to get decimal point to right place. Then multiply Al and Bl, and shift right the right amount to get decimal point to right place. Then add these two together. Basic multiplication like doing it on paper, basically. — hyde, Apr 03 '14 at 18:49

njuffa · Accepted Answer · 2015-02-27T09:20:16.620

11

Signed Q1.31 format is a fully fractional format capable of representing operands between -1 and almost +1. The scale factor is 2³¹. This means that when each Q1.31 operand is stored in a 32-bit signed integer, we can generate the Q1.31 product by computing the full double-width product of the signed integers, then right shifting the result by 31 bits. The right shift is necessary because the product includes the scale factor twice, and the shift acts as a division that removes one instance of the scale factor.

We can compute the double-width product of two 32-bit integers by separately computing the upper and lower 32 bits of the full product. The lower 32 bits are computed trivially as the ordinary product of the two inputs. To compute the upper 32 bits, we need to write a function mul32hi(). In order to avoid using a wider type (i.e. one using more than 32 bits) in intermediate computations, we need to split the original operands into halves, compute their partial products, and then sum these partial products appropriately.

Note that various processors provide a hardware instruction that implements the functionality of mul32hi(). In this case one would want to use an appropriate intrinsic, or a bit of inline assembly code if no intrinsic exists, rather than use the emulation code presented here.

It helps to reduce the problem first to the corresponding unsigned multiplication, umul32hi(), then derive the signed result from that via the definition of 2's complement representation (which is assumed in the following C code):

#include <stdint.h>

/* compute the upper 32 bits of the product of two unsigned 32-bit integers */
uint32_t umul32hi (uint32_t a, uint32_t b)
{
    /* split operands into halves */
    uint32_t al = (uint16_t)a;
    uint32_t ah = a >> 16;
    uint32_t bl = (uint16_t)b;
    uint32_t bh = b >> 16;
    /* compute partial products */
    uint32_t p0 = al * bl;
    uint32_t p1 = al * bh;
    uint32_t p2 = ah * bl;
    uint32_t p3 = ah * bh;
    /* sum partial products */
    uint32_t cy = ((p0 >> 16) + (uint16_t)p1 + (uint16_t)p2) >> 16;
    return p3 + (p2 >> 16) + (p1 >> 16) + cy;
}

/* compute the upper 32 bits of the product of two signed 32-bit integers */
int32_t mul32hi (int32_t a, int32_t b)
{
    return umul32hi (a, b) - ((a < 0) ? b : 0) - ((b < 0) ? a : 0);
}

/* compute the full 64-bit product of two signed 32-bit integers */
void mul32wide (int32_t a, int32_t b, int32_t *rhi, int32_t *rlo)
{
    *rlo = a * b;           /* bits <31:0> of the product a * b */
    *rhi = mul32hi (a, b);  /* bits <63:32> of the product a * b */
}

/* compute the product of two signed Q1.31 fixed-point numbers */    
int32_t mul_q_1_31 (int32_t a, int32_t b)
{
    int32_t hi, lo;
    mul32wide (a, b, &hi, &lo);
    /* Q1.31 is scaled by 2**31, trim out scale factor */
    return (int32_t)(((uint32_t)hi << 1) | ((uint32_t)lo >> 31));
}

I interpreted the request to "leave the overflow case" to mean to ignore overflow. As a consequence, multiplying -1 (0x80000000) by -1 (0x80000000) with mul_q_1_31() will return -1 (0x80000000).

edited Feb 27 '15 at 09:20

answered Apr 03 '14 at 19:39

njuffa

23,970
4
78
130

Nice solution but I think you could have put some emphasis into efficiency. I think you do one more multiplication than necessary. When you calculate `umul32hi` you can construct `lo` as well without doing `a*b`. I think it would also be interesting to create a function `umul32hi(uint32_t a, uint32_t b, uint32_t lo)` using only three multiplications. – Z boson Feb 27 '15 at 08:53
@Zboson: Some of the computation for `rlo` and `rhi` could be shared, I agree. I am not sure what you mean by "three multiplications" Are you thinking about Karatsuba? I wrote the code the way I did because on many platforms `mul32hi()` may simply be an intrinsic that maps directly to a hardware instruction (or one could use inline assembly if such an instruction exists but no intrinsic that wraps it). I should have mentioned that in my answer, but forgot. – njuffa Feb 27 '15 at 09:03
No, I'm not referring to Karatsuba? I mean that if you already have `lo` that you don't need to construct `p0 = al * bl`. You have to calculate the carry in a different way then. – Z boson Feb 27 '15 at 09:07
The reason I care is that I'm thinking of a SIMD algorithm for 128x128 and Q32.96 and there is no hardware for SIMD mul64hi (and no SIMD FLAGS register) so I'm trying to minimize the number of mults and adds. That's why this question is interesting. – Z boson Feb 27 '15 at 09:10
I am sure it is doable, but it is not quite so straightforward, as `rlo` consists of `p0` as well as the lower eight bits of `p1` and `p2`. Will it be faster overall? I am not sure, as 16-bit and 32-bit integer multiplies are fast on most platforms these days. You may want to post a new question regarding your specific use case with its specific restrictions. – njuffa Feb 27 '15 at 09:17
You can also do `lo = ((p1 & 0xffff) << 16) + ((p2 & 0xfff) <<16) + p0` then you don't have to do `*rlo = a * b;` inside `mul32wide`. You can either calculate `lo` inside `umul32hi` or you pass `lo` to the function. Either way only four multiplications are necessary. – Z boson Feb 27 '15 at 09:23
See http://www.hackersdelight.org/hdcodetxt/mulhs.c.txt and http://www.hackersdelight.org/hdcodetxt/muldws.c.txt – Z boson Feb 27 '15 at 09:27
Re "only four multiplications are necessary": I am aware of that as I have written such wide-multiply emulations at assembly code level for multiple platforms in the course of my professional work. I already explained why I modularized this code the way I did. On most modern processors `mul32wide()` should only be two instructions MULLO and MULHI, except that C/C++ do not offer a standard interface to the latter. Consider posting your own answer with maximum optimizations. I'd be happy to upvote it. – njuffa Feb 27 '15 at 09:36
I made a question instead https://stackoverflow.com/questions/28766755/compute-the-doubleword-product-signed-of-two-words-given-the-lower-word-produc – Z boson Feb 27 '15 at 13:56
2

Your answer saved me! https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit/28827013#28827013 I wish I could upvote you again. How did you come up with the formula `hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0)`? – Z boson Mar 03 '15 at 08:28
3

@Zboson: It follows directly from two's complement complement representation. E.g. 32-bit integers -n and -m are represented as unsigned numbers `x=2**32-n`, `y=2**32-m`. If you multiply those you have `x*y = 2**64- 2**32*n - 2**32*m + n*m`. The middle terms indicate the necessary corrections to the upper half of the product. Working through a simple example using -1*-1 should prove very instructive. – njuffa Mar 03 '15 at 15:29
I found this answer while trying to extend the similar answer https://stackoverflow.com/a/1815371/2430597 for signed integers. Great solution, highly appreciated! – plasmacel Jun 25 '18 at 16:14

score -1 · Answer 2 · answered Apr 04 '14 at 14:44

The following site purports to offer a 32-bit implementation of an ANSI-C library for operating with Q-numbers in general: http://www.ti.com/tool/sprc542

I have not tried this and cannot vouch for what it will or will not do for you, but there is a link at that site where you can request the source code.

32-bit signed integer multiplication without using 64-bit data type

2 Answers2

Linked