Alternative method for creating low clock frequencies in VHDL

Question

In the past I asked a question about resets, and how to divide a high clock frequency down to a series of lower clock square wave frequencies, where each output is a harmonic of one another e.g. the first output is 10 Hz, second is 20 Hz etc.

I received several really helpful answers recommending what appears to be the convention of using a clock enable pin to create lower frequencies.

An alternative since occurred to me; using a n bit number that is constantly incremented, and taking the last x bits of the number as the clock ouputs, where x is the number of outputs.

It works in synthesis for me - but I'm curious to know - as I've never seen it mentioned anywhere online or on SO, am I missing something that means its actually a terrible idea and I'm simply creating problems for later?

I'm aware that the limitations on this are that I can only produce frequencies that are the input frequency divided by a power of 2, and so most of the time it will only approximate the desired output frequency (but will still be of the right order). Is this limitation the only reason it isn't recommended?

Thanks very much!

David

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
library UNISIM;
use UNISIM.VComponents.all;
use IEEE.math_real.all;

ENTITY CLK_DIVIDER IS
    GENERIC(INPUT_FREQ : INTEGER;       --Can only divide the input frequency by a power a of 2 
            OUT1_FREQ  : INTEGER
    );
    PORT(SYSCLK  : IN  STD_LOGIC;
         RESET_N : IN  STD_LOGIC;
         OUT1    : OUT STD_LOGIC;       --Actual divider is  2^(ceiling[log2(input/freq)])
         OUT2    : OUT STD_LOGIC);      --Actual output is input over value above
END CLK_DIVIDER;

architecture Behavioral of Clk_Divider is
    constant divider      : integer                             := INPUT_FREQ / OUT1_FREQ;
    constant counter_bits : integer                             := integer(ceil(log2(real(divider))));
    signal counter        : unsigned(counter_bits - 1 downto 0) := (others => '0');
begin
    proc : process(SYSCLK)
    begin
        if rising_edge(SYSCLK) then
            counter <= counter + 1;
            if RESET_N = '0' then
                counter <= (others => '0');
            end if;
        end if;
    end process;
    OUT1 <= counter(counter'length - 1);
    OUT2 <= not counter(counter'length - 2);
end Behavioral;

Morten Zilmer · Accepted Answer · 2015-12-16T13:37:09.990

Functionally the two outputs OUT1 and OUT2 can be used as clocks, but that method of making clocks does not scale and is likely to cause problems in the implementation, so it is a bad habit. However, it is of course important to understand why this is so.

The reason it does not scale, is that every signal used as clock in a FPGA is to be distributed through a special clock net, where the latency and skew is well-defined, so all flip-flops and memories on each clock are updated synchronously. The number of such clock nets is very limited, usually in the range of 10 to 40 in a FPGA device, and some restrictions on use and location makes it typically even more critical to plan the use of clock nets. So it is typically required to reserve clock nets for only real asynchronous clocks, where there is no alternative than to use a clock net.

The reason it is likely to cause problems, is that clocks created based on bits in a counter have no guaranteed timing relation. So if it is required to moved data between these clock domains, it requires additional constrains for synchronization, in order to be sure that the Clock Domain Crossing (CDC) is handled correctly. This is done through constrains for synthesis and/or Static Timing Analysis (STA), and is usually a little tricky to get right, so using a design methodology that simplifies STA is habit that saves design time.

So in designs where it is possible to use a common clock, and then generate synchronous clock enable signals, this should be the preferred approach. For the specific design above, a clock enable can be generated simply by detecting the '0' to '1' transition of the relevant counter bit, and then assert the clock enable in the single cycle where the transition is detected. Then a single clock net can be used, together with 2 clock enables like CE1 and CE2, and no special STA constrains are required.

Just for clarification - what do you mean by it doesn't scale? — davidhood2, Dec 16 '15 at 13:25

score 2 · Answer 2 · edited May 23 '17 at 12:25

Morten already pointed out the theory in his answer. With the aid of two examples, I will demonstrate the problems you encounter when using a generated clock instead of clock enables.

Clock Distribution

At first, one must take care that a clock arrives at (almost) the same time at all destination flip-flops. Otherwise, even a simple shift register with 2 stages like this one would fail:

process(clk_gen)
begin
  if rising_edge(clk_gen) then
    tmp <= d;
    q   <= tmp;
  end if;
end if;

The intended behavior of this example is that q gets the value of d after two rising edges of the generated clock clock_gen. If the generated clock is not buffered by a global clock buffer, then the delay will be different for each destination flip-flop because it will be routed via the general-purpose routing. Thus, the behavior of the shift register can be described as follows with some explicit delays:

library ieee;
use ieee.std_logic_1164.all;
entity shift_reg is
  port (
    clk_gen : in  std_logic;
    d       : in  std_logic;
    q       : out std_logic);
end shift_reg;

architecture rtl of shift_reg is
  signal ff_0_q : std_logic := '0';  -- output of flip-flop 0
  signal ff_1_q : std_logic := '0';  -- output of flip-flop 1
  signal ff_0_c : std_logic;    -- clock input of flip-flop 0
  signal ff_1_c : std_logic;    -- clock input of flip-flop 1
begin  -- rtl

  -- different clock delay per flip-flop if general-purpose routing is used
  ff_0_c <= transport clk_gen after  500 ps;
  ff_1_c <= transport clk_gen after 1000 ps;

  -- two closely packed registers with clock-to-output delay of 100 ps
  ff_0_q <= d      after 100 ps when rising_edge(ff_0_c);
  ff_1_q <= ff_0_q after 100 ps when rising_edge(ff_1_c);

  q <= ff_1_q;
end rtl;

The following test bench just feeds in a '1' at input d, so that, q should be '0' after 1 clock edge an '1' after two clock edges.

library ieee;
use ieee.std_logic_1164.all;

entity shift_reg_tb is
end shift_reg_tb;

architecture sim of shift_reg_tb is
  signal clk_gen : std_logic;
  signal d       : std_logic;
  signal q       : std_logic;
begin  -- sim
  DUT: entity work.shift_reg port map (clk_gen => clk_gen, d => d, q => q);

  WaveGen_Proc: process
  begin
    -- Note: registers inside DUT are initialized to zero
    d       <= '1';                     -- shift in '1'
    clk_gen <= '0';
    wait for 2 ns;
    clk_gen <= '1';                     -- just one rising edge
    wait for 2 ns;
    assert q = '0' report "Wrong output" severity error;
    wait;
  end process WaveGen_Proc;
end sim;

But, the simulation waveform shows that q already gets '1' after the first clock edge (at 3.1 ns) which is not the intended behavior. That's because FF 1 already sees the new value from FF 0 when the clock arrives there.

This problem can be solved by distributing the generated clock via a clock tree which has a low skew. To access one of the clock trees of the FPGA, one must use a global clock buffer, e.g., BUFG on Xilinx FPGAs.

Data Handover

The second problem is the handover of multi-bit signals between two clock domains. Let's assume we have 2 registers with 2 bits each. Register 0 is clocked by the original clock and register 1 is clocked by the generated clock. The generated clock is already distributed by clock tree.

Register 1 just samples the output from register 0. But now, the different wire delays for both register bits in between play an important role. These have been modeled explicitly in the following design:

library ieee;
use ieee.std_logic_1164.all;
library unisim;
use unisim.vcomponents.all;

entity handover is
  port (
    clk_orig : in  std_logic;                      -- original clock
    d        : in  std_logic_vector(1 downto 0);   -- data input
    q        : out std_logic_vector(1 downto 0));  -- data output
end handover;

architecture rtl of handover is
  signal div_q   : std_logic := '0';    -- output of clock divider
  signal bufg_o  : std_logic := '0';    -- output of clock buffer
  signal clk_gen : std_logic;           -- generated clock
  signal reg_0_q : std_logic_vector(1 downto 0) := "00";  -- output of register 0
  signal reg_1_d : std_logic_vector(1 downto 0);     -- data input  of register 1
  signal reg_1_q : std_logic_vector(1 downto 0) := "00";  -- output of register 1
begin  -- rtl

  -- Generate a clock by dividing the original clock by 2.
  -- The 100 ps delay is the clock-to-output time of the flip-flop.
  div_q <= not div_q after 100 ps when rising_edge(clk_orig);

  -- Add global clock-buffer as well as mimic some delay.
  -- Clock arrives at (almost) same time on all destination flip-flops.
  clk_gen_bufg : BUFG port map (I => div_q, O => bufg_o);
  clk_gen <= transport bufg_o after 1000 ps;

  -- Sample data input with original clock
  reg_0_q <= d after 100 ps when rising_edge(clk_orig);

  -- Different wire delays between register 0 and register 1 for each bit
  reg_1_d(0) <= transport reg_0_q(0) after  500 ps;
  reg_1_d(1) <= transport reg_0_q(1) after 1500 ps;

  -- All flip-flops of register 1 are clocked at the same time due to clock buffer.
  reg_1_q <= reg_1_d after 100 ps when rising_edge(clk_gen);
  q <= reg_1_q;
end rtl;

Now, just feed in the new data value "11" via register 0 with this testbench:

library ieee;
use ieee.std_logic_1164.all;

entity handover_tb is
end handover_tb;

architecture sim of handover_tb is
  signal clk_orig : std_logic := '0';
  signal d        : std_logic_vector(1 downto 0);
  signal q        : std_logic_vector(1 downto 0);
begin  -- sim
  DUT: entity work.handover port map (clk_orig => clk_orig, d => d, q => q);

  WaveGen_Proc: process
  begin
    -- Note: registers inside DUT are initialized to zero
    d <= "11";
    clk_orig <= '0';
    for i in 0 to 7 loop                -- 4 clock periods
      wait for 2 ns;
      clk_orig <= not clk_orig;
    end loop;  -- i
    wait;
  end process WaveGen_Proc;
end sim;

As can be seen in the following simulation output, the output of register 1 toggles to an intermediate value of "01" at 3.1 ns first because the input of register 1 (reg_1_d) is still changing when the rising edge of the generated clock occurs. The intermediate value was not intended and can lead to undesired behavior. The correct value is seen not until another rising edge of the generated clock.

To solve this issue, one can use:

special codes, where only one bit flips at a time, e.g., gray code, or
cross-clock FIFOs, or
handshaking with the help of single control bits.

Alternative method for creating low clock frequencies in VHDL

2 Answers2

Clock Distribution

Data Handover