More resource efficient way to get the maximum of the last 512 values

Question

I have written some VHDL code that stores the last 512 values of an input signal and calculates the largest of the stored values. This code works but uses a lot of the LUT resources of my FPGA. The purpose of the code is to calculate the largest value of the last 512 samples, is there a more resource efficient way of achieving this? (it's important that it calculates the largest of the last 512 values and not the largest value observed from that input, the latter could easily be achieved storing a single number).

Alternatively is there someway I could code the VHDL such that the synthesiser would implement the array as block RAM (BRAM) instead of LUTs?

The synthesiser I am using is LabVIEW FPGA (which I believe uses XilinX ISE internally to compile/synthesise the VHDL).

My current code for this is shown below:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity RecentMax is
  port (
    clk : in std_logic;
    reset : in std_logic;
    InputSignal : in std_logic_vector(15 downto 0);
    Max : out std_logic_vector(15 downto 0)
    );
end RecentMax;

architecture RTL of RecentMax is
-- declarations
  type Array512 is array(0 to 511) of signed(15 downto 0);
  signal PastVals : Array512;
  type Array256 is array(0 to 255) of signed(15 downto 0);
  signal result : Array256;

  signal CalculationState : unsigned(1 downto 0);
  signal NLeftToCompute : unsigned(8 downto 0);
begin
-- behaviour
  process(clk)
  begin
    if(rising_edge(clk)) then
      if(reset = '1') then
        -- reset values
        for i in PastVals'low to PastVals'high loop
          PastVals(i) <= (others => '0');
        end loop;
        for i in result'low to result'high loop
          result(i) <= (others => '0');
        end loop;
        CalculationState <= to_unsigned(0, 2);
        Max <= std_logic_vector(to_signed(0, 16));
        NLeftToCompute <= to_unsigned(256, 9);
      else
        -- do stuff
        case to_integer(CalculationState) is
          when 0 =>
            for i in PastVals'low to PastVals'high-1 loop
              PastVals(i+1) <= PastVals(i);
            end loop;
            PastVals(0) <= signed(InputSignal);
            Max <= std_logic_vector(result(0));
            NLeftToCompute <= to_unsigned(256, 9);
            CalculationState <= to_unsigned(1, 2);
          when 1 =>
            for i in 0 to 255 loop
              if (i <= to_integer(NLeftToCompute)-1) then
                if PastVals(i*2) > PastVals(i*2+1) then
                  result(i) <= PastVals(i*2);
                else
                  result(i) <= PastVals(i*2+1);
                end if;
              end if;
            end loop;
            NLeftToCompute <= shift_right(NLeftToCompute, 1);
            CalculationState <= to_unsigned(2, 2);
          when 2 =>;
            for i in 0 to 127 loop
              if (i <= to_integer(NLeftToCompute)-1) then
                if result(i*2) > result(i*2+1) then
                  result(i) <= result(i*2);
                else
                  result(i) <= result(i*2+1);
                end if;
              end if;
            end loop;
            if NLeftToCompute > 2 then
              NLeftToCompute <= shift_right(NLeftToCompute, 1);
            else
              CalculationState <= to_unsigned(0, 2);
            end if;
          when others =>
            --- do nothing - shouldn't get here
        end case;
    end if;
  end if;
end process;
end RTL;

You can't use a BlockRAM, because you have a 512-port memory. You are reading all values in the same clock cycle, so there is no storage optimization possible. You'll also get a big comparator, because you are comparing 512 values times 16 bits at once. You should think about creating a tree of comparators to cut the critical path to increase the frequency of the resulting circuit. — Paebbels, Sep 12 '17 at 12:39
You example code is implementing a FSM with different processing stages. I think your entity should offer handshake ports, so signal when it's ready to accept new data and when to hold data for processing. But this behavior is a contradiction to your question, which implies continuous data. So just to be sure, it that data continuous or can a calculation take for example 512 cycles to complete? — Paebbels, Sep 12 '17 at 12:44
The data is continuous, a new sample is taken every clock cycle. — SomeRandomPhysicist, Sep 12 '17 at 13:18
It would probably be fine for my application if it could only calculate the max every 512 cycles. I guess in this case it'd have to read 512 values, take 512 cycles to calculate the max and then read in the next 512 cycles. However in this case I guess I could just use a single number to store the max, spend 512 cycles reading values and only storing the max it has seen. Then use this as the new max for 512 cycles as it calculates the new max. — SomeRandomPhysicist, Sep 12 '17 at 13:34
I'm not 100% percent sure, but I think this would be fine for the application I have in mind. — SomeRandomPhysicist, Sep 12 '17 at 13:35
There is effective algorithm for keeping maximum in sliding window of given size: https://stackoverflow.com/questions/12190184/ Don't know - is it easy to implement such logic in your system? Here deque requires static buffer with size 512. Arbitrary example: http://www.geeksforgeeks.org/implementation-deque-using-circular-array/ — MBo, Sep 12 '17 at 14:31
If calculating a value every 512 cycles is ok, then you can do this: use a dual port BlockRAM. Port A continuously writes new values to 512 addresses as a ring buffer. This keeps your set of the latest 512 values. Port B is used to scan all 512 values and compare it to a single maximum register. After 512 compares, the register and the read-pointer is reset. You can improve the cycle time by using a 32 or 64-bit wide BlockRAM with 256 or 128 entires, respectively. The write port updates 2 or 4 times the same address before going to the next address. — Paebbels, Sep 12 '17 at 16:50
The read port reads and compares 2 or 4 values per cycles against your maximum register leading to one maximum value every 256 or 128 cycles. I think the question is, how big is the variation of values. Or in other words do you expect drastic maximum changes with a 256 or 128 cycle calculation window? — Paebbels, Sep 12 '17 at 16:53
@MBo We are talking hear about synthesizable VHDL code that results in hardware: So we have no dynamic memory allocation in FPGAs. More over, the article is about an `O(n)` solution, but the question is for a `O(1)` solution, which is possible in hardware, but consumes a lot of resources. — Paebbels, Sep 12 '17 at 17:01
I could expect drastic maximum changes depending on the frequency of the input signal relative to the sample frequency. I have implemented a simpler solution which takes N clock cycles but doesn't require storing the last N values at all. This is shown in my answer below. Thanks for your help. — SomeRandomPhysicist, Sep 12 '17 at 17:05
@Paebbels Yes, I realize that VHDL code is drastically different, that is why I doubted about implementation. Buffer is static. O(n) solution is for all array length, so amortized O(1) per new entering element. — MBo, Sep 12 '17 at 17:12

score 1 · Answer 1 · answered Sep 12 '17 at 11:55

1

There are two possibilities for what you want.

First you can choose to instantiate a BRAM by creating a dedicated process and array, so that the synthesizer choose to use the block ram instead of 512 luts.

CONSTANT DATA_WIDTH     : integer := 16;
CONSTANT ADD_WIDTH      : integer := 9; -- 512 addresses
CONSTANT DPRAM_DEPTH    : integer := 2**ADD_WIDTH; -- depth of the memory

TYPE dpram IS ARRAY (0 TO DPRAM_DEPTH - 1) OF std_logic_vector(DATA_WIDTH - 1 DOWNTO 0);
SIGNAL mem              : dpram;

and then the process :

dpram_gen_p : PROCESS (clk_i)
BEGIN
    IF (rising_edge(clk_i)) THEN
        IF (wr_req_i = '1') THEN
            mem(wr_add_s) <= wr_data_i;
        END IF;
        rd_data_s <= mem(rd_add_s);
    END IF;
END PROCESS;

For the main synthesizer, this syntax will be implemented as a block RAM. Then, instead of your PastVals signal you need to use the data port of the RAM. Be careful about the reading cycles since you need one clock cycle to change the address of the read interface (rd_add_s) and one to actually read the data (rd_data_s).

The second option (according to me it is the easiest and fastest) is just to implement a fifo memory (you can use the Xilinx IP core generator) of size 512 words.

https://www.xilinx.com/support/documentation/ip_documentation/fifo_generator/v13_1/pg057-fifo-generator.pdf

Then you just need to write in the fifo until it's full and finally read the data word by word until it's empty and register the highest value as you already did in your design.

answered Sep 12 '17 at 11:55

A. Kieffer

372
2
13

2

This won't work, because he is processing all compares in one clock cycle. This requires a 512-port RAM. – Paebbels Sep 12 '17 at 12:40
Could I store the array in a BRAM and have the BRAM be ordered such that I only have to read out the end value to get the largest? – SomeRandomPhysicist Sep 12 '17 at 13:20
Can I add a sorting algorithm to the process for example, or would this mean that it could not be implemented as BRAM? – SomeRandomPhysicist Sep 12 '17 at 13:30
@Paebbels that's what he is doing currently, but not what he's asking for, is it? – A. Kieffer Sep 12 '17 at 15:03
@A.Kieffer I also was a bit unsure, but he clarified, that the input is provided on every clock cycle. Otherwise I would fully support your solution :). I'm currently brooding about a carrychain based solution to find the number with the most leftmost ones in a set of 512 16-bit numbers.... – Paebbels Sep 12 '17 at 16:56
1

@SomeRandomPhysicist no, you cannot add logic to the RAM storage, for the reason you already suggest. Currently the RAMs are simple store and fetch components. So for ordering you would need to perform a large amount of store and fetch operations. This requires an additional controller plus a large amount of clock cycles to process. In future FPGAs we will hopefully have the possibility for in-memory operations, although this will probably only be available fit high-end FPGAs. – JHBonarius Sep 13 '17 at 08:27

SomeRandomPhysicist · Accepted Answer · 2017-09-13T13:27:44.483

For this particular application it was sufficient to just update the maximum every 512 clock cycles. My updated code solution is shown below. I'd still be interested in an answer to this question, as to if there's a more resource efficient method that works in a low number of clock cycles.

Code Solution:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity RecentMax is
  port (
    clk : in std_logic;
    reset : in std_logic;
    InputSignal : in std_logic_vector(15 downto 0);
    Max : out std_logic_vector(15 downto 0)
    );
end RecentMax;

architecture RTL of RecentMax is
-- declarations
signal counter : integer;
signal RecentMax : signed(15 downto 0);

begin
-- behaviour
  process(clk)
  begin
    if(rising_edge(clk)) then
      if(reset = '1') then
        -- reset values
        counter <= 0;
        RecentMax <= to_signed(0, 16);
      else
      -- do stuff
      if counter = 0 then
        Max <= std_logic_vector(RecentMax);
        counter <= counter + 1;
        RecentMax <= to_signed(0, 16);
      else
        if signed(InputSignal) > RecentMax then
          RecentMax <= signed(InputSignal);
        end if;
        if counter >= 511 then
          counter <= 0;
        else
          counter <= counter + 1;
        end if;
      end if;  
    end if;
  end if;
end process;
end RTL;

The code you provide is incomplete. `counter ` is not declared anywhere. — JHBonarius, Sep 13 '17 at 08:29
You will not likely find a solution using significantly less resources, nor less clock cycles. How would you expect to find the maximum of 512 values in less then 512 cycles without using massive amounts of logic? Algorithmically, I mean. — JHBonarius, Sep 13 '17 at 08:33
Thanks, I completed the code, I accidentally copied an incomplete version. — SomeRandomPhysicist, Sep 13 '17 at 13:28

More resource efficient way to get the maximum of the last 512 values

2 Answers2