5

I have the following data:

A = [a0 a1 a2 a3 a4 a5 .... a24]
B = [b0 b1 b2 b3 b4 b5 .... b24]

which I then want to multiply as follows:

C = A * B' = [a0b0 a1b1 a2b2 ... a24b24]

This clearly involves 25 multiplies.

However, in my scenario, only 5 new values are shifted into A per "loop iteration" (and 5 old values are shifted out of A). Is there any fast way to exploit the fact that data is shifting through A rather than being completely new? Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations). I initially thought a systolic array might help, but it doesn't (I think!?)

Update 1: Note B is fixed for long periods, but can be reprogrammed.

Update 2: the shifting of A is like the following: a[24] <= a[19], a[23] <= a[18]... a[1] <= new01, a[0] <= new00. And so on so forth each clock cycle

Many thanks!

trican
  • 1,157
  • 4
  • 15
  • 24
  • Does `B` stay the same through the whole process of shifting? – Sergey Kalinichenko Apr 11 '13 at 15:50
  • 1
    How about building 5 multipliers to work in parallel or 1 multiplier clocked at 5x rate? – Alexey Frunze Apr 11 '13 at 15:53
  • What is B like? e.g. if this was a DSP algorithm with fixed filter length, B could consist of small coefficients with an average of less than 2 bits set. – Aki Suihkonen Apr 11 '13 at 15:58
  • 2
    What's with the tags? Is this C code or vhdl or assembly or something entirely else? What are you trying to do? Design a CPU or write a program to execute on one? – jalf Apr 11 '13 at 15:58
  • @jalf :-) its dedicated hardware, I'm just trying to increase the number of eyeballs that see this question - there are not that many subscribers to the vhdl or verilog tags – trican Apr 11 '13 at 16:03
  • @dasblinkenlight B stays the same .... most of the time, but can be reprogrammed. – trican Apr 11 '13 at 16:04
  • @AkiSuihkonen The values of B are currently about 16 bits - these possibly could be optimized – trican Apr 11 '13 at 16:05
  • I'd consider in that case distributed arithmetic with CSA tree. – Aki Suihkonen Apr 11 '13 at 16:07
  • @AlexeyFrunze I only want a single clock here. 5 multipliers in parallel? I'm not sure how that would work? I need 25 results per clock cycle and 5 new values are shifting in per clock cycle – trican Apr 11 '13 at 16:15
  • by **shifting** do you mean that the element at that index is being replaced by a new one? for example `a[0]` with new element or `a[0]` with `a[1]`and `a[1]` with `a[2]` an so on?are you playing with DMA for direct storage? – Koushik Shetty Apr 11 '13 at 16:25
  • @Koushik the new values a[25] <= a[20], a[24] <= a[19]... a[1] <= new01, a[0] <= new00. I hope this clarifies matters – trican Apr 11 '13 at 16:31
  • it should be 26 multiplications rite as per your code? – Koushik Shetty Apr 11 '13 at 16:46
  • Sorry, zero index description bug there! Its 25 multiplications, entries are from 0 to 24 - I'll update the question – trican Apr 11 '13 at 16:54
  • @trican: "trying to incrase the number of eyeballs"... by inserting misleading tags? Here, have a downvote. Next time, try using *relevant* tags. – jalf Apr 11 '13 at 17:52
  • @jalf - I disagree strongly, I believe those are suitable tags - this is an algorithmic question regarding efficient implementation (albeit in hardware) - if there is an optimization possible people knowledgeable on low c or assembly are as likely to be able to contribute useful suggestions (see andand's answer below) as hardware orientated folk. Alternatively what tags would you suggest? – trican Apr 11 '13 at 18:05

3 Answers3

2

Is there any fast way to exploit the fact that data is shifting through A rather than being completely new?

Even though all you're doing is the shifting and adding new elements to A, the products in C will, in general, all be different since one of the operands will generally change after each iteration. If you have additional information about the way the elements of A or B are structured, you could potentially use that structure to reduce the number of multiplications. Barring any such structural considerations, you will have to compute all 25 products each loop.

Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations).

In theory, you can reduce the number of multiplications to 0 by shifting and adding the array elements to simulate multiplication. In practice, this will be slower than a hardware multiplication so you're better off just using any available hardware-based multiplication unless there's some additional, relevant constraint you haven't mentioned.

andand
  • 17,134
  • 11
  • 53
  • 79
1

on the very first 5 data set you could be saving upto 50 multiplications. but after that its a flat road of multiplications. since for every set after the first 5 set you need to multiply with the new set of data.

i'l assume all the arrays are initialized to zero. i dont think those 50 saved are of any use considering the amount of multiplication on the whole.

But still i will give you a hint on how to save those 50 maybe you could find an extension to it?

1st data set arrived : multiply the first data set in a with each of the data set in b. save all in a, copy only a[0] to a[4] to c. 25 multiplications here.

2nd data set arrived : multiply only a[0] to a[4](having new data) with b[0] to b[4] resp. save in a[0] to a[4],copy to a[0->9] to c. 5 multiplications here

3rd data set arrived : multiply a[0] to a[9] with b[0] to b[9] this time and copy to corresponding a[0->14] to c.10 multiplications here

4th data set : multiply a[0] to a[14] with corresponding b copy corresponding a[0->19] to c. 15 multiplications here.

5th data set : mutiply a[0] to a[19] with corresponding b copy corresponding a[0->24] to c. 20 multiplications here.

total saved mutiplications : 50 multiplications.

6th data set : usual data multiplications. 25 each. this is because for each set in the array a there a new data set avaiable so multiplication is unavoidable.

Koushik Shetty
  • 2,146
  • 4
  • 20
  • 31
0

Can you add another array D to flag the changed/unchanged value in A. Each time you check this array to decide whether to do new multiplications or not.

Tianyun Ling
  • 1,045
  • 1
  • 9
  • 24