4

Could anyone tell the difference in the partial products reduction method or mechanism between Wallace and Dadda multipliers ?

I have been reading A_comparison_of_Dadda_and_Wallace_multiplier_delays.pdf

Wallace VS Dadda

kevin998x
  • 899
  • 2
  • 10
  • 23

1 Answers1

3

Both are very similar. Instead of the traditional row based algorithm, they all aim to implement a multiplication A*B by 1/ anding A with bits b_i, 2/ counting bits for every column until there are only two rows and 3/ performing the final addition with a fast adder.

I worked on a Dadda multiplier, but this was many many years ago, and I am not sure to remember all the details. To my best knowledge, the main difference are in the counting process.

Wallace introduced the "Wallace tree" structure (that is still useful in some design). This allows, given n bits, to count the number of bits at 1 in this set. A (n,m) wallace tree (where m=ceil(log_2 n)) gives the number of bits at 1 among the n inputs and outputs the result on m bits. This is somehow a combinatorial counter. For instance, below is a the schematic of a (7,3) Wallace tree made with full adders (that are (3,2) Wallace trees).

enter image description here

As you can see, this tree generates results of logical weight 2^0, 2^1 and 2^2, if input bits are of weight 2^0.

enter image description here

This allows a fast reduction in the height of the columns, but can be somehow inefficient in terms of gate count.

Luigi Dadda do not use such an aggressive reduction strategy and tries to keep the columns heights more balanced. Only full (or half adders) are used and every counting/reduction will only generate bits of weight 2^0 and 2^1. The reduction process is less efficient (as can be seen by the larger number of rows in your figure), but the gate count is better. Dadda strategy was also supposed to be slightly less time efficient, but according to the enclosed paper, that I did not knew, it is not true.

The main interest in Wallace/Dadda multipliers is that they can achieve a multiplication with ~log n time complexity, which is much better than the traditional O(n) array multiplier with carry save adders. But, despite this theoretical advantage, they are not really used any longer. Present architectures are more concerned with throughput than latency and prefer to use simpler array structures than can be efficiently pipelined. Implementing Wallace/Dadda structure is a real nightmare beyond a few bits and adding pipeline to them is very complex due to their irregular structure.

Note that other multiplier designs yield to log n time complexity, with a more regular and implementable divide-and-conquer strategy, as for instance the Luk-Vuillemin multiplier.

Alain Merigot
  • 10,667
  • 3
  • 18
  • 31
  • What do you think about [figure 39.5 (4-2 adder)](https://i.imgur.com/XZn5EPX.png) that seems to make wallace tree to have regular structure ? – kevin998x Jan 27 '19 at 05:46
  • If we pipeline these 4-2 adders, would that make this structure more attractive in terms of circuit regularity as well as latency (time complexity) compared to carry-save array multiplier ? – kevin998x Jan 27 '19 at 05:58
  • From [Area-Time optimal VLSI integer multiplier with minimum computation time](https://www.sciencedirect.com/science/article/pii/S0019995883800618) , why does Luk-Vuillemin multiplier use fourier transform ? Won't this add more complexity in hardware implementation ? – kevin998x Jan 27 '19 at 07:17
  • WT are regular, the problem is the global interconnection that leads to irregular structures. The problem to pipeline them is the varying width. The traditional 'dot' representation (as in your question) masks the fact that all dots represent an information that must be forwarded to next stage. And your figure is just for a 8x8 bits mult. Second problem, how to balance the different stages traversal time? And integer 64x64 mult in present processors have a 3 cy latency. Can one reduce it to 2? Maybe, but the implementation cost is very high. – Alain Merigot Jan 27 '19 at 07:22
  • Wait, did you see the use of 4-2 adder to REPLACE wallace tree in my previous comment ? – kevin998x Jan 27 '19 at 07:29
  • Luk-Vuillemin use a divide and conquer stategy. Given A=2^n/2xAh+Al, they consider that AxB=2^nxAhxBh+2^n/2(AhxBl+AlxBh)+AlxBl. After applying that recursively, they perform the summations with CS adders. The generation strategy is much simpler to understand and implement. And the complexity is similar to Wallace-Dadda mults. – Alain Merigot Jan 27 '19 at 07:30
  • I missed the replacement of WT by 4-2 adders. Indeed this is more regular. You can reduce by 2 the number of bits to sum in 2 layers. You still have some irregularity, though, because the bits add are arranged in a trapezoid, not a rectangle. – Alain Merigot Jan 27 '19 at 07:39
  • How do the additions end up in trapezoid shape ? – kevin998x Jan 27 '19 at 07:46
  • The initial columns are trapezoid. If columns to add were all 8 bits, the 4-2 reduction would be regular. But some are 7, 6, 5, etc. – Alain Merigot Jan 27 '19 at 07:59
  • I am bit confused , "some are 7, 6, 5, etc" ? – kevin998x Jan 27 '19 at 08:43
  • Sorry, I was not clear enough. This represents the number of bits in the columns that must be counted. These columns are shaped like a trapezoid (as in the upper left part of your figure) and this introduces some irregularity, even with a regular reduction strategy. – Alain Merigot Jan 27 '19 at 09:10
  • " Given A=2^n/2xAh+Al, they consider that AxB=2^nxAhxBh+2^n/2(AhxBl+AlxBh)+AlxBl. After applying that recursively, they perform the summations with CS adders. " <== this is a bit confusing, an actual numerical example ? By the way, where did you quote this from ? In the pdf link in my previous comment, I only saw polynomial multiplication. – kevin998x Jan 27 '19 at 11:50
  • Let T(N) be the time a AxB where A and B are n bits. Assume you compute AxB=2^nxAhxBh+2^n/2(AhxBl+AlxBh)+AlxBl. All the mults on half words are performed in parallel and require a time T(n/2). If the additions are done with a CSA, it take time O(1). Hence T(n)=T(n/2)+O(1) and T(n)=log n. I do not have presently the Luk-Vuillemin paper. Maybe in a dusty paper stack in my office. This is an old paper, but it may be found online. I do not know Luk, but you can drop a mail to Jean Vuillemin https://www.di.ens.fr/~jv/ – Alain Merigot Jan 27 '19 at 12:06
  • I think this is the [paper : Vuillemin, J. (1983). A very fast multiplication algorithm for VLSI implementation. Integration](https://sci-hub.tw/10.1016/0167-9260(83)90005-6) that you were referring to ? – kevin998x Jan 27 '19 at 12:26
  • I also remember a paper with Luk that was maybe more detailed. But the idea is absolutely the same. – Alain Merigot Jan 27 '19 at 12:35
  • Were you referring to section 2 of [Recursive implementation of optimal time VLSI integer multipliers](https://pdfs.semanticscholar.org/415c/d98dafb5c9cb358c94189927e1f3216b7494.pdf#page=8) , but there are some VLSI algorithms such as 4M, 3M and 2M versions ? Have you ever used any of these ? – kevin998x Jan 28 '19 at 02:25
  • Besides, is DADDA irregular as well ? and if yes, why so ? What do you think about using [multiplier bit-pair encoding](http://vulms.vu.edu.pk/Courses/CS501/Downloads/Bit-Pair%20Recoding.pdf#page=3) on the 4-2 adders or DADDA ? – kevin998x Jan 28 '19 at 03:33
  • Besides, regarding the [4-2 adders](https://i.imgur.com/XZn5EPX.png) , all partial bits are arranged in ONLY two levels. Why did you claim that "some are 7, 6, 5, etc" ? – kevin998x Jan 28 '19 at 04:23