How can I speed up my math operations in VHDL?

Question

I have some calculations going on currently at rising edge of a 75MHz pixel clock to output 720p video on screen. Some of the math (like a few modulo) take too long (20+ns whereas 75MHz is 13.3ns) so my timing constraints are not met. I'm new to FPGAs but I'm wondering if for example there is a way to run the calculations at a faster speed than the current pixel clock in order to have them completed by the next tick of the 75MHz clock. I'm using VHDL by the way.

There are a multitude of ways to optimise for timing - if you can show us a specific problem, we can give a specific answer (or a few at least) — Martin Thompson, Feb 08 '13 at 13:33

score 14 · Answer 1 · 2013-02-08T20:43:06.907

75 MHz is already quite slow by today's FPGA standards.

The problem is the modulo operation, which effectively involves division; and division is slow.

Think carefully about the operations you need, and if there is any way to reorganise the computation. If you are clocking pixels it's not as if you have 32-bit integers to deal with; restricted values are easier to deal with.

Martin hinted at one option: strength reduction. If you have 1280 pixels/line and need to operate on every third one, you don't need to compute 1280 mod 3! Count 0,1,2,0,... instead.

Another, if you need modulo-3 of an 8-bit (or 12-bit) number is to store all possible values in a lookup table, which will be fast enough.

Or sometimes you can multiply by 1/3 (X"5555") instead of dividing by 3, then multiply by 3 (which is a single addition) and subtract to get the modulo. This pipelines really well, but since X"5555" is only an approximation to 1/3 you need to verify in simulation that it delivers the correct output for every input. (for 16-bit inputs, this isn't a big simulation!) The extension to modulo 9 is easy.

EDIT:

Two points from your comments : Another option you have is to create a X2 clock (150MHz) using the Spartan's clock generators, which gives you 2 cycles per pixel. Well pipelined code should meet 150 MHz without much trouble.

How not to pipeline!

PROCESS(Clk)
BEGIN
    if(rising_edge(Clk)) then
        for i in 0 to 2 loop
            case i is
                when 0 => temp1 <= a*data;
                when 1 => temp2 <= temp1*b;
                when 2 => result <= temp2*c;
                when others => null;
            end case;
        end loop;
    end if;
END PROCESS;

The first thing to realise is that the loop and case statement cancel each other out, so this simplifies to

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        temp1 <= a*data;
        temp2 <= temp1*b;
        result <= temp2*c;
    end if;
END PROCESS;

which is buggy! The testbench also being buggy, hides the problem.

In cycle 1, Data,a,b,c are presented, and temp1 = Data*a is computed.
In cycle 2, temp1 is multiplied by a NEW value of b instead of the correct one!
Same again in cycle 3!

Since the testbench sets the inputs and leaves them constant, it won't catch the problem!

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a*data;
        b_copy  <= b;
        c_copy1 <= c;
        -- cycle 2
        temp2   <= temp1*b_copy;
        c_copy2 <= c_copy1;
        -- cycle 3
        result  <= temp2*c_copy2;
    end if;
END PROCESS;

I like to comment each cycle; every term I use in a cycle must come from the immediately preceding cycle, either by calculation or from a copy.

At least this works, but it could be reduced to 2 cycles depth and fewer copy registers because in this example, the four inputs are independent (and I am assuming there are no measures required to avoid overflow). So:

PROCESS(Clk)
BEGIN
    if rising_edge(Clk) then
        -- cycle 1
        temp1   <= a * data;
        temp2   <= b * c;
        -- cycle 2
        result  <= temp1 * temp2;
    end if;
END PROCESS;

The chip can run much faster (it's a spartan6), but it's just that the pixel clock is at 75MHz, and I don't know how to 'run other things faster'. I do understand now that I can just count though. — bparker, Feb 08 '13 at 19:31
Thanks for explaining 'How not to pipeline!'. I remembered it since it helped me understand pipelining when I first started. Now I see it's a very silly code. — HeyYO, Feb 08 '13 at 21:03

score 12 · Accepted Answer · answered Feb 08 '13 at 13:43

Here's some techniques:

Pipelining - split the logic up to operate over multiple clock cycles
multi-cycle path - if you don't need the answer every cycle, you can tell the tools that it's OK for it to take longer. Care is required not to tell the tools the wrong thing though!
Think again - for example, do you really need to do x mod 3 on very wide x, or could you use a continuously updated modulo 3 counter?
Use better tools - I've had instances where I could meet timing on a deep-logic-path using an expensive synthesizer compared to not meeting timing on the same code using the vendor's synthesizer.

More extreme solutions involve changing the silicon, for a faster device, or a newer device, or a newer, faster device.

HeyYO · Answer 3 · 2013-02-08T21:07:27.980

2

Usually complex math operations in FPGAs are pipelined. Pipelining means you divide your operations to stages. Let's say you have a multiplier which takes too long for your clock speed. You divide your multiplier to 3 stages. Basically your multiplier consists of three different parts (which has their own clock input) chained one after. These three parts will be smaller then one part, so they will have a smaller delay thus you can use a faster clock for them.

A drawback of this will be the 'delay'. Your pipelined system will give output with a latency. In the multiplier example above to have the correct output, you have to wait until your input passes all 3 stages. But this is usually very small (depending on your design of course) and can be ignored.

Here is a good (!) post about this: http://vhdlguru.blogspot.com/2011/01/what-is-pipelining-explanation-with.html EDIT: See Brian's post instead.

Also vendors usually ship optimized and pipelined versions of math operations as IP cores in their design software. Look for them.

edited Feb 08 '13 at 21:07

answered Feb 08 '13 at 05:28

HeyYO

1,989
17
24

If I have an operation for example, "1280 mod 3" or "720 mod 9", how would that be broken down into smaller pieces? I do understand about pipelining, just not sure how a modulo itself would be pipelined. – bparker Feb 08 '13 at 05:42
Well you should try implementing your own modulo function. Then you work on pipelining that. But that's another topic. – HeyYO Feb 08 '13 at 08:00
1

Good grief, that "vhdlguru" pipeline example has to be one of the silliest pieces of VHDL I have seen in a long time! – Feb 08 '13 at 17:03
1

@BrianDrummond Maybe my opinion is invalid being a beginner but, I found the article very informative on teaching how pipelining works and how it can be implemented. If you have a better article, please do share it with us. – bparker Feb 08 '13 at 19:19
look at the example code ... the loop and the case statement do precisely nothing! leaving only the three assignments. And it's buggy because ... OK I'd better edit it into my answer. – Feb 08 '13 at 20:21

How can I speed up my math operations in VHDL?

3 Answers3

Linked