The handling of clock within your chip and within your simulation environment requires the same type of care you take in doing a board design. In particular clock skew must always be smaller than the smallest propagation delay.
In an RTL simulation environment, all of the delays on signals are measured in terms of delta cycles (the default delay for any signal assignment when you are not using after). Going through a port does not incur any delta cycles. However, every assignment to a signal causes a delta cycle delay.
One method to insure successful data transfer is to make sure all clocks in the design are delta cycle aligned when they are used. The simplest way to make sure this happens is to make sure that none of the blocks do an assignment to the clock they use. Hence, do not do any of the following:
LocalClk <= PortClk ; -- each assignment causes a delta cycle of clock skew
GatedClk <= Clk and Enable ; -- clock gates are bad. See alternative below
Generally we rarely use clock gates - and then we only do it when it is an approved part of our methodology (usually not for FPGAs). In place of using gated clocks in your design, use data path enables:
process (Clk)
begin
if rising_edge(Clk) then
if Enable = '1' then
Q <= D ;
end if ;
end if ;
end process ;
There are other methodologies to sort this out.