01signal: The art of Timing Closure

This page belongs to a series of pages about timing. The previous pages explained a few basics topics: The theory behind timing calculations and the clock period timing constraint. A few timing reports were also shown and explained. Now it's time to talk about one of the important purposes of this knowledge: Solving timing problems.

Introduction

The greatest struggle of the FPGA tools is to achieve the requirements of the timing constraints. This is hopefully successful, but sometimes it fails. And when it fails, it's us humans who carry the duty to find out why it failed, and to fix it. This task has a name: We call it the Timing Closure. And it's not an easy task.

Why is timing closure difficult? The thing is, that the tools have a place and route algorithm that attempts to use the FPGA's resources in an optimal way. Usually, this algorithm begins with placing the logic elements on the FPGA without too much effort. Then an iterative process begins: The tools go through all the paths, and find the paths that fail to meet the timing constraints. To rectify these failures, corrective measures are taken on these paths. Most notably, the logic elements are moved to different positions on the FPGA, and the routing is adjusted. As for more advanced corrective measures, each FPGA tool has its own methods.

When all paths meet the timing constraints, the implementation is considered finished. But the implementation can also end because the tools failed to reach this goal, and consequently gave up the attempt. In this situation, we get what the tools have achieved when the efforts stopped. This result isn't necessarily optimal: There might be paths in the implementation that could have been improved, but the tools were busy fixing something else. And when that failed, the tools gave up without trying to fix other things. It's like the tools were saying: "There's no point wasting time fixing an implementation if it's going to fail anyway".

Our task as FPGA designers is to look at this sub-optimal result, and find the reason why the goals of the timing constraints weren't achieved.

The algorithms get better over time. When there's a common reason for a failure to achieve the timing constraints, the next version of the software will have a specific solution for that situation. So when the tools fail, there is usually a good reason.

So we look at what the tools achieved, and ask ourselves: Why did the tools fail? Did we request something that is impossible? Even more important, did we request something unnecessary? Maybe the obstacle that caused the tools to fail is something we don't even need. Or maybe, the optimization algorithm didn't work so well? Sometimes it's just a matter of bad luck: The initial placement of the logic elements can be so bad, that the subsequent attempts to improve the performance is doomed to fail.

No matter what the problem is, finding the reason for the failure is similar to a detective that investigates an crime scene: The facts are in front of us, but the reason is often hidden. Most of these facts can be found in the timing reports, but the clues don't give themselves away easily. The question that one must always ask is what is wrong, unusual or abnormal in the timing report. Like the detective who tries to find the criminal, the goal is to find the detail that leads to the problem.

But in order to find what is abnormal, you must know what is normal: For example, what is a normal delay for a net with a certain fan-out? How many of logic levels are normal in order to implement a certain logic function? The answers to questions of this sort are different from one FPGA to another. It's therefore necessary to gain experience by reading and understanding the timing reports, even when everything is fine. You must be able to know what a timing report looks like when everything is OK in order to find the places where the timing report indicates a problem. In case you wonder why I went into the fine details in the the previous pages, that's one of the reasons.

The Critical Path

When the tools fail to achieve the timing constraints, it means that there is at least one path that has a negative slack. The path with the most negative slack is called the Critical Path. This name reflects the usual strategy for timing closure: Focusing on the Critical Path is often the way to solve a timing problem. But I shall demonstrate below that this strategy can also be waste of time.

When the timing constraints are achieved, the Critical Path is the path with the minimal slack. This path is often not interesting, because the tools don't attempt to improve paths with a positive slack. So if the worst path has a positive slack, it can be a coincidence that this path turned out to be the worst one.

But if the slack is positive and nearly zero (say, less than 0.2 ns), this can indicate that it was difficult make this path achieve the timing constraints. Critical paths of this sort can be seen as warnings that these paths can cause trouble in the future (in particular when the FPGA becomes filled with more logic, so the tools' efforts are diverted to other paths).

The timing report usually contains a limited number of critical paths for each clock. The default of most FPGA tools is to display a few critical paths even if their slack is positive (i.e. when the timing constraints were achieved). This is the recommended setting.

An example of a critical path

I shall begin with an example of a Critical Path's analysis. For this example, the Verilog code is as follows:

   reg [24:0] calc, result;
   reg [11:0] x, y, z;

   always @(posedge clk)
     begin
	calc <= x * y + z;
	result <= calc;
     end

In this example, @clk's frequency is 250 MHz, and no PLL is used to generate this clock. Also suppose that @x, @y and @z are registers that are synchronous with @clk. The Verilog code that assigns values to these registers is not shown, because it's irrelevant.

The timing constraints were not achieved when attempting this code on Vivado. In the timing report this was the Critical Path:

Slack (VIOLATED) :        -0.239ns  (required time - arrival time)
  Source:                 x_reg[1]__0_replica_2/C
                            (rising edge-triggered cell FDRE clocked by clk  {rise@0.000ns fall@2.000ns period=4.000ns})
  Destination:            calc_reg[23]/D
                            (rising edge-triggered cell FDRE clocked by clk  {rise@0.000ns fall@2.000ns period=4.000ns})
  Path Group:             clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            4.000ns  (clk rise@4.000ns - clk rise@0.000ns)
  Data Path Delay:        4.180ns  (logic 1.642ns (39.282%)  route 2.538ns (60.718%))
  Logic Levels:           7  (CARRY8=4 LUT3=1 LUT4=1 LUT6=1)
  Clock Path Skew:        -0.087ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    3.176ns = ( 7.176 - 4.000 ) 
    Source Clock Delay      (SCD):    3.864ns
    Clock Pessimism Removal (CPR):    0.601ns
  Clock Uncertainty:      0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Total Input Jitter      (TIJ):    0.000ns
    Discrete Jitter          (DJ):    0.000ns
    Phase Error              (PE):    0.000ns
  Clock Net Delay (Source):      2.032ns (routing 0.396ns, distribution 1.636ns)
  Clock Net Delay (Destination): 1.748ns (routing 0.365ns, distribution 1.383ns)

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clk rise edge)        0.000     0.000 r  
    AG12                                              0.000     0.000 r  clk (IN)
                         net (fo=0)                   0.000     0.000    clk_IBUF_inst/I
    AG12                 INBUF (Prop_INBUF_HRIO_PAD_O)
                                                      0.738     0.738 r  clk_IBUF_inst/INBUF_INST/O
                         net (fo=1, routed)           0.105     0.843    clk_IBUF_inst/OUT
    AG12                 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O)
                                                      0.049     0.892 r  clk_IBUF_inst/IBUFCTRL_INST/O
                         net (fo=1, routed)           0.839     1.731    clk_IBUF
    BUFGCE_X1Y0          BUFGCE (Prop_BUFCE_BUFGCE_I_O)
                                                      0.101     1.832 r  clk_IBUF_BUFG_inst/O
    X2Y0 (CLOCK_ROOT)    net (fo=106, routed)         2.032     3.864    clk_IBUF_BUFG
    SLICE_X54Y54         FDRE                                         r  x_reg[1]__0_replica_2/C
  -------------------------------------------------------------------    -------------------
    SLICE_X54Y54         FDRE (Prop_HFF2_SLICEL_C_Q)
                                                      0.137     4.001 r  x_reg[1]__0_replica_2/Q
                         net (fo=21, routed)          0.371     4.372    x[1]_repN_2
    SLICE_X56Y53         LUT6 (Prop_E6LUT_SLICEL_I1_O)
                                                      0.219     4.591 r  calc[23]_i_101/O
                         net (fo=2, routed)           0.550     5.141    calc[23]_i_101_n_0
    SLICE_X54Y57         CARRY8 (Prop_CARRY8_SLICEL_DI[5]_CO[7])
                                                      0.228     5.369 r  calc_reg[23]_i_30/CO[7]
                         net (fo=1, routed)           0.030     5.399    calc_reg[23]_i_30_n_0
    SLICE_X54Y58         CARRY8 (Prop_CARRY8_SLICEL_CI_O[1])
                                                      0.163     5.562 r  calc_reg[23]_i_22/O[1]
                         net (fo=3, routed)           0.351     5.913    calc_reg[23]_i_22_n_14
    SLICE_X56Y57         LUT3 (Prop_C6LUT_SLICEL_I1_O)
                                                      0.146     6.059 r  calc[23]_i_26/O
                         net (fo=3, routed)           0.240     6.299    calc[23]_i_26_n_0
    SLICE_X55Y58         LUT4 (Prop_A6LUT_SLICEM_I0_O)
                                                      0.089     6.388 r  calc[23]_i_7/O
                         net (fo=1, routed)           0.407     6.795    calc[23]_i_7_n_0
    SLICE_X53Y57         CARRY8 (Prop_CARRY8_SLICEM_DI[2]_O[4])
                                                      0.308     7.103 r  calc_reg[23]_i_2/O[4]
                         net (fo=1, routed)           0.538     7.641    P[20]
    SLICE_X54Y56         CARRY8 (Prop_CARRY8_SLICEL_S[4]_O[7])
                                                      0.352     7.993 r  calc_reg[23]_i_1/O[7]
                         net (fo=1, routed)           0.051     8.044    P0_out[23]
    SLICE_X54Y56         FDRE                                         r  calc_reg[23]/D
  -------------------------------------------------------------------    -------------------

                         (clock clk rise edge)        4.000     4.000 r  
    AG12                                              0.000     4.000 r  clk (IN)
                         net (fo=0)                   0.000     4.000    clk_IBUF_inst/I
    AG12                 INBUF (Prop_INBUF_HRIO_PAD_O)
                                                      0.515     4.515 r  clk_IBUF_inst/INBUF_INST/O
                         net (fo=1, routed)           0.066     4.581    clk_IBUF_inst/OUT
    AG12                 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O)
                                                      0.034     4.615 r  clk_IBUF_inst/IBUFCTRL_INST/O
                         net (fo=1, routed)           0.722     5.337    clk_IBUF
    BUFGCE_X1Y0          BUFGCE (Prop_BUFCE_BUFGCE_I_O)
                                                      0.091     5.428 r  clk_IBUF_BUFG_inst/O
    X2Y0 (CLOCK_ROOT)    net (fo=106, routed)         1.748     7.176    clk_IBUF_BUFG
    SLICE_X54Y56         FDRE                                         r  calc_reg[23]/C
                         clock pessimism              0.601     7.777    
                         clock uncertainty           -0.035     7.741    
    SLICE_X54Y56         FDRE (Setup_HFF_SLICEL_C_D)
                                                      0.063     7.804    calc_reg[23]
  -------------------------------------------------------------------
                         required time                          7.804    
                         arrival time                          -8.044    
  -------------------------------------------------------------------
                         slack                                 -0.239

The slack of this path was –0.239 ns, so it's a slight failure to meet the timing constraints. The first thing to examine is the beginning and end of this path: We look at the "Source" and "Destination" in the report's header, and find x_reg and calc_reg. So the source of the problem is clearly this part:

calc <= x * y + z;

This should not be a surprise, because it's the only meaningful operation in the Verilog code. In a real-life scenario, it's not as obvious what part of the logic caused the problem.

It's also evident from the timing report that there is a large number of logic levels: 7. The combinatorial path is too long. In other words, there is too much to be done between two clock edges of @clk.

But what is the actual reason for the failure? Maybe the fact that the routing is responsible for 61% of the delay? Recall that there's a rule of thumb that the routing delay is usually 40% of the total delay. So maybe try to work with the FPGA tools to perform better? That is however unlikely to be a successful solution, because the tools usually work hard before giving up the attempts to achieve the timing constraints of a path.

Trying to change the logic function between @x and @calc is equally futile: The multiplication is essential, so there's no way put something simpler instead.

I shall present other possible approaches to solving a problem like this on the next page. But going through a list of techniques will not help in this case. This simple example demonstrates that sometimes we need to think like detectives.

There is no substitute for your brain

The first question to ask when reading a timing report is what's abnormal about it. In this example, the answer is that the combinatorial path consists of only slices: Virtually all FPGAs have designated arithmetic units (DSPs, ALUs, the names vary) which are used when a multiplication is requested. In fact, multiply and add is the most common function in such designated logic. So the simplest solution in most cases is to make the tools use a designated arithmetic unit. The timing report for this solution is shown at the bottom of this page.

But the real question we should ask is why slices were used instead of a designated arithmetic unit. The most common reason is that all of the FPGA's available arithmetic units are already used by some other part of the design. In this case, the necessary change may not be related to the critical path at all: Maybe there's a need to free a few arithmetic units by removing some amount of logic from the design. Another possibility could be to instruct the tools to allocate these arithmetic units differently among the different parts of the design.

In this example, slices were used instead of arithmetic units because I wanted that to happen: I deliberately turned off the usage of arithmetic units (with one of Vivado's synthesizer parameters: I set max_dsp to zero). But this doesn't make this example artificial. Sometimes incorrect parameters are used with the FPGA tools, leading to exactly this kind of situation. In fact, it's sometimes correct to deliberately not use arithmetic units, because they are more needed somewhere else in the design.

So the easy solution was to use designated arithmetic units. But what if we must use slices? Once again, the solution is indirect. Recall that the problematic part was:

calc <= x * y + z;

But note that this comes immediately afterwards:

result <= calc;

If @calc is used only in this row, and not anywhere else, it's possible to split the calculation into two stages. This technique is often called pipelining. So the Verilog code changes to this:

   reg [24:0] calc, result;
   reg [11:0] x, y, z, z_d;

   always @(posedge clk)
     begin
	z_d <= z;

	calc <= x * y;
	result <= calc + z_d;
     end

In this solution, @calc is given the result of the multiplication only. The value of @z is added to @calc only on the next stage. More precisely, the plus operation is between @calc and @z_d, because this operation happens one clock cycle later. So the value of @result is exactly like before.

It was easy to solve the problem this way because @result was just a delayed copy of @calc in the original Verilog code. In real life, we're usually not this lucky.

Note that the critical path involved only @calc and @x. @z wasn't even mentioned in the path. So the purpose of this manipulation is to reduce the burden of the arithmetic operation. Or more precisely, to reduce the number of logic levels.

Recall that the critical path is the worst path after the execution of an optimization algorithm. This algorithm doesn't ask about the reason for the problem. Rather, it tries to improve the paths that have a negative slack. So even though the solution to the problem requires a manipulation of @z, the path that is related to @z wasn't the critical path. This was just a coincidence, but this happens a lot.

The timing report with the critical path of this solution is also shown at the bottom of this page. It shows that the number of logic levels was reduced from 7 to 6. As a result, the data path delay was reduced by 0.715 ns, which was more than enough to meet the timing constraints.

The lesson to learn from this example is that the critical path isn't always the direct reason for the problem. It's still correct to ask why this path failed, but the solution can be somewhere else. Keep in mind that each FPGA tool has its own utilities that provide information which can help find the root cause of the problem. It's worth the time to explore these utilities and read their documentation.

Avoid the problem early rather than solving it later

Much of the work on timing closure can be avoided if the logic design is done correctly from the beginning. This requires a continuous awareness of the fact that a logic design is not software. The purpose of the Verilog code is not to produce the correct result during the simulation. What really matters is the output generated by the synthesizer from the Verilog code.

A good logic design starts with thinking through how the logic should best accomplish its purpose. This includes identifying the potential obstacles that are related to timing.

Less experienced FPGA designers often develop Verilog code by trial and error. The simulation is used to see if the logic works as expected, and corrections are made gradually until the output of the simulation is correct. The result can be logic that can't be used on hardware: The Verilog code has to be rewritten completely in order to achieve the timing constraints.

It's important to think ahead about the combinatorial paths that the Verilog code generates. The idea is to look at each register, and follow the combinatorial path to its end. Recall that a combinatorial path always begins at a register and ends at a register.

Let's look at this example:

reg [15:0] a, b;
wire [16:0] x, y;
reg [33:0] z;

assign x = a + 2;
assign y = b + 3;

always @(posedge clk)
  z <= x * y;

Regarding @a: When this register changes, the combinatorial path reaches @x as a first stage. But @x is not a register. @x is updated by virtue of a continuous assignment. Hence the path continues to @z. So the logic carries out two significant operations in this combinatorial path: An arithmetic addition and an arithmetic multiplication. Is this too much? Is there a need to split this into two clock cycles, by virtue of pipelining? That depends on the clock frequency and which FPGA is used.

Another important factor is how difficult it is to make the combinatorial paths shorter. Sometimes a long combinatorial path is inevitable. But when it's easy to improve the timing of a path, do that even if there are much worse paths in the design: If there are a few problematic paths in a design, the tools can often achieve the timing constraints by concentrating the efforts on these paths. It helps a lot when it's easy to meet the timing requirements of the other paths.

So there are no rules for what is allowed or disallowed in order to obtain a design that achieves the timing constraints. It requires experience with FPGA design in order to make correct decisions in this field. Experience with the specific FPGA tools is also important. The only rule that is always correct is: When it's possible to improve the timing by a simple change in the Verilog code, do it. Don't be lazy, and make this change from the beginning. Always have timing in mind.

Writing logic that is fast

As already mentioned, the goal is to achieve short combinatorial paths between registers: The logic functions that calculate the next value of registers should be simple. In other words, a small number of logic levels should be required to implement these logic functions.

Our task as FPGA designers is to look at the Verilog code and assess how complicated the logic functions will be. This requires knowledge on how the synthesizer translates Verilog into logic elements (LUTs and other logic primitives). This knowledge is acquired through experience, which is partly gained by analyzing the timing reports. To make things even more difficult, this translation is different from one FPGA to another. So it's not an easy task to write Verilog code that results in fast logic.

If you're an FPGA novice, it's recommended to spend the time to look at the results of the implementation for the sake of learning. The timing report shows examples of how the logic design is broken down into simple logic elements. The FPGA tools also offer other utilities for viewing the low-level logic elements.

In addition, there are a few simple rules that can help:

Pipelining: Be generous with registers. When possible, split logic's tasks into small pieces, and insert a register after each step. FPGAs have a lot of flip-flops (often there is one flip-flop next to each LUT), so this insertion of registers will not increase the utilization level of the FPGA. The only reason to avoid pipelining is when it complicates the design too much.
When if-then-else is used, avoid chaining many "else" clauses. When a "case" statement can be used instead, it's usually better. An "else" clause often requires a logic function that ensures that everything before this clause is false. Multiple "else" clauses can therefore require several logic levels.
Avoid unnecessary resets. In particular, a synchronous reset adds a small amount of complexity to the logic function. Both types of resets add to the difficulty of routing, because these are signals that must reach many logic elements. There is a separate series of pages on this topic.
Don't create huge state machines. There is no hard limit on how many states is OK. But if the number of states goes beyond 20, you should consider to restructure your design. Also, be sure that the synthesizer uses one-hot encoding for large state machines (by default, most synthesizers do). This helps generating fast logic.
An extra register after a RAM is usually better. Let's look at this example of a RAM that is created implicitly:
```
reg [7:0] array[0:127];
reg [7:0] val;
reg [6:0] addr;

always @(posedge clk)
  val <= array[addr];
```
This Verilog code is correct, but note that @val is the synchronous output of the RAM. So when there is a rising clock edge, the operation of the RAM begins, and @val is updated only when the value from the array has been obtained. Hence @val has a relatively large clock-to-output delay (compared with a flip-flop). Accordingly, there is an inherent disadvantage to paths that begin at @val. This can be fixed by adding an extra register:
```
reg [7:0] array[0:127];
reg [7:0] val_d, mem_out;
reg [6:0] addr;

always @(posedge clk)
  begin
    mem_out <= array[addr];
    val_d <= mem_out;
  end
```
Note that this is not functionally equivalent: @mem_out is the output of the RAM here. Only one clock later is this output copied into @val_d, so it's not an exact replacement for @val. But @val_d is a real register, with a low clock-to-output delay. In many FPGAs, this extra register is part of the block RAMs, so there is no waste of flip-flops. Did I mention to never worry about wasting flip-flops?
Unfortunately, adding such a register often complicates the design considerably. When this is the case, it's better not to add this extra register, but rather attempt to keep the combinatorial path from @val short.

Two extra timing reports

I've promised two timing reports in the section called "There is no substitute for your brain" from above. I've put them here, and not where they're mentioned, because they are long and not fully relevant.

Note that each of these two timing reports is the critical path for the relevant scenario. Hence the path doesn't start and end at the same registers as the path that is shown above.

The first timing report relates to the first Verilog code example. Unlike the timing report above, the tools were allowed to use designated arithmetic units. The result is that the timing constraints were easily achieved.

This timing report was generated for a Kintex Ultrascale FPGA. In this family of FPGAs, a designated arithmetic unit is called a DSP48E2. Note that the path starts and ends on the same DSP48E2 unit. The logic delay is therefore 100%.

Slack (MET) :             1.406ns  (required time - arrival time)
  Source:                 calc_reg/DSP_A_B_DATA_INST/CLK
                            (rising edge-triggered cell DSP_A_B_DATA clocked by clk  {rise@0.000ns fall@2.000ns period=4.000ns})
  Destination:            calc_reg/DSP_OUTPUT_INST/ALU_OUT[10]
                            (rising edge-triggered cell DSP_OUTPUT clocked by clk  {rise@0.000ns fall@2.000ns period=4.000ns})
  Path Group:             clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            4.000ns  (clk rise@4.000ns - clk rise@0.000ns)
  Data Path Delay:        2.445ns  (logic 2.445ns (100.000%)  route 0.000ns (0.000%))
  Logic Levels:           4  (DSP_ALU=1 DSP_M_DATA=1 DSP_MULTIPLIER=1 DSP_PREADD_DATA=1)
  Clock Path Skew:        -0.010ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    3.392ns = ( 7.392 - 4.000 ) 
    Source Clock Delay      (SCD):    4.096ns
    Clock Pessimism Removal (CPR):    0.694ns
  Clock Uncertainty:      0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Total Input Jitter      (TIJ):    0.000ns
    Discrete Jitter          (DJ):    0.000ns
    Phase Error              (PE):    0.000ns
  Clock Net Delay (Source):      2.264ns (routing 0.756ns, distribution 1.508ns)
  Clock Net Delay (Destination): 1.964ns (routing 0.696ns, distribution 1.268ns)

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clk rise edge)        0.000     0.000 r  
    AG12                                              0.000     0.000 r  clk (IN)
                         net (fo=0)                   0.000     0.000    clk_IBUF_inst/I
    AG12                 INBUF (Prop_INBUF_HRIO_PAD_O)
                                                      0.738     0.738 r  clk_IBUF_inst/INBUF_INST/O
                         net (fo=1, routed)           0.105     0.843    clk_IBUF_inst/OUT
    AG12                 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O)
                                                      0.049     0.892 r  clk_IBUF_inst/IBUFCTRL_INST/O
                         net (fo=1, routed)           0.839     1.731    clk_IBUF
    BUFGCE_X1Y0          BUFGCE (Prop_BUFCE_BUFGCE_I_O)
                                                      0.101     1.832 r  clk_IBUF_BUFG_inst/O
    X2Y1 (CLOCK_ROOT)    net (fo=80, routed)          2.264     4.096    calc_reg/CLK
    DSP48E2_X11Y34       DSP_A_B_DATA                                 r  calc_reg/DSP_A_B_DATA_INST/CLK
  -------------------------------------------------------------------    -------------------
    DSP48E2_X11Y34       DSP_A_B_DATA (Prop_DSP_A_B_DATA_DSP48E2_CLK_A2_DATA[9])
                                                      0.302     4.398 r  calc_reg/DSP_A_B_DATA_INST/A2_DATA[9]
                         net (fo=1, routed)           0.000     4.398    calc_reg/DSP_A_B_DATA.A2_DATA<9>
    DSP48E2_X11Y34       DSP_PREADD_DATA (Prop_DSP_PREADD_DATA_DSP48E2_A2_DATA[9]_A2A1[9])
                                                      0.182     4.580 r  calc_reg/DSP_PREADD_DATA_INST/A2A1[9]
                         net (fo=1, routed)           0.000     4.580    calc_reg/DSP_PREADD_DATA.A2A1<9>
    DSP48E2_X11Y34       DSP_MULTIPLIER (Prop_DSP_MULTIPLIER_DSP48E2_A2A1[9]_U[10])
                                                      0.994     5.574 f  calc_reg/DSP_MULTIPLIER_INST/U[10]
                         net (fo=1, routed)           0.000     5.574    calc_reg/DSP_MULTIPLIER.U<10>
    DSP48E2_X11Y34       DSP_M_DATA (Prop_DSP_M_DATA_DSP48E2_U[10]_U_DATA[10])
                                                      0.164     5.738 r  calc_reg/DSP_M_DATA_INST/U_DATA[10]
                         net (fo=1, routed)           0.000     5.738    calc_reg/DSP_M_DATA.U_DATA<10>
    DSP48E2_X11Y34       DSP_ALU (Prop_DSP_ALU_DSP48E2_U_DATA[10]_ALU_OUT[10])
                                                      0.803     6.541 r  calc_reg/DSP_ALU_INST/ALU_OUT[10]
                         net (fo=1, routed)           0.000     6.541    calc_reg/DSP_ALU.ALU_OUT<10>
    DSP48E2_X11Y34       DSP_OUTPUT                                   r  calc_reg/DSP_OUTPUT_INST/ALU_OUT[10]
  -------------------------------------------------------------------    -------------------

                         (clock clk rise edge)        4.000     4.000 r  
    AG12                                              0.000     4.000 r  clk (IN)
                         net (fo=0)                   0.000     4.000    clk_IBUF_inst/I
    AG12                 INBUF (Prop_INBUF_HRIO_PAD_O)
                                                      0.515     4.515 r  clk_IBUF_inst/INBUF_INST/O
                         net (fo=1, routed)           0.066     4.581    clk_IBUF_inst/OUT
    AG12                 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O)
                                                      0.034     4.615 r  clk_IBUF_inst/IBUFCTRL_INST/O
                         net (fo=1, routed)           0.722     5.337    clk_IBUF
    BUFGCE_X1Y0          BUFGCE (Prop_BUFCE_BUFGCE_I_O)
                                                      0.091     5.428 r  clk_IBUF_BUFG_inst/O
    X2Y1 (CLOCK_ROOT)    net (fo=80, routed)          1.964     7.392    calc_reg/CLK
    DSP48E2_X11Y34       DSP_OUTPUT                                   r  calc_reg/DSP_OUTPUT_INST/CLK
                         clock pessimism              0.694     8.086    
                         clock uncertainty           -0.035     8.050    
    DSP48E2_X11Y34       DSP_OUTPUT (Setup_DSP_OUTPUT_DSP48E2_CLK_ALU_OUT[10])
                                                     -0.104     7.946    calc_reg/DSP_OUTPUT_INST
  -------------------------------------------------------------------
                         required time                          7.946    
                         arrival time                          -6.541    
  -------------------------------------------------------------------
                         slack                                  1.406

The second timing report relates to the second example of Verilog code. In this example, the situation has been improved by virtue of pipelining:

Slack (MET) :             0.433ns  (required time - arrival time)
  Source:                 y_reg[1]__0/C
                            (rising edge-triggered cell FDRE clocked by clk  {rise@0.000ns fall@2.000ns period=4.000ns})
  Destination:            calc_reg[23]/D
                            (rising edge-triggered cell FDRE clocked by clk  {rise@0.000ns fall@2.000ns period=4.000ns})
  Path Group:             clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            4.000ns  (clk rise@4.000ns - clk rise@0.000ns)
  Data Path Delay:        3.465ns  (logic 1.653ns (47.706%)  route 1.812ns (52.294%))
  Logic Levels:           6  (CARRY8=4 LUT4=2)
  Clock Path Skew:        -0.129ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    3.373ns = ( 7.373 - 4.000 ) 
    Source Clock Delay      (SCD):    4.040ns
    Clock Pessimism Removal (CPR):    0.538ns
  Clock Uncertainty:      0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Total Input Jitter      (TIJ):    0.000ns
    Discrete Jitter          (DJ):    0.000ns
    Phase Error              (PE):    0.000ns
  Clock Net Delay (Source):      2.208ns (routing 0.756ns, distribution 1.452ns)
  Clock Net Delay (Destination): 1.945ns (routing 0.696ns, distribution 1.249ns)

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clk rise edge)        0.000     0.000 r  
    AG12                                              0.000     0.000 r  clk (IN)
                         net (fo=0)                   0.000     0.000    clk_IBUF_inst/I
    AG12                 INBUF (Prop_INBUF_HRIO_PAD_O)
                                                      0.738     0.738 r  clk_IBUF_inst/INBUF_INST/O
                         net (fo=1, routed)           0.105     0.843    clk_IBUF_inst/OUT
    AG12                 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O)
                                                      0.049     0.892 r  clk_IBUF_inst/IBUFCTRL_INST/O
                         net (fo=1, routed)           0.839     1.731    clk_IBUF
    BUFGCE_X1Y0          BUFGCE (Prop_BUFCE_BUFGCE_I_O)
                                                      0.101     1.832 r  clk_IBUF_BUFG_inst/O
    X2Y1 (CLOCK_ROOT)    net (fo=121, routed)         2.208     4.040    clk_IBUF_BUFG
    SLICE_X54Y88         FDRE                                         r  y_reg[1]__0/C
  -------------------------------------------------------------------    -------------------
    SLICE_X54Y88         FDRE (Prop_EFF2_SLICEL_C_Q)
                                                      0.138     4.178 r  y_reg[1]__0/Q
                         net (fo=25, routed)          0.505     4.683    y[1]
    SLICE_X53Y91         LUT4 (Prop_B6LUT_SLICEM_I0_O)
                                                      0.150     4.833 r  calc[7]_i_28/O
                         net (fo=1, routed)           0.344     5.177    calc[7]_i_28_n_0
    SLICE_X53Y89         CARRY8 (Prop_CARRY8_SLICEM_DI[2]_CO[7])
                                                      0.424     5.601 r  calc_reg[7]_i_9/CO[7]
                         net (fo=1, routed)           0.043     5.644    calc_reg[7]_i_9_n_0
    SLICE_X53Y90         CARRY8 (Prop_CARRY8_SLICEM_CI_O[0])
                                                      0.122     5.766 r  calc_reg[23]_i_30/O[0]
                         net (fo=3, routed)           0.402     6.168    calc_reg[23]_i_30_n_15
    SLICE_X51Y88         LUT4 (Prop_C5LUT_SLICEL_I0_O)
                                                      0.169     6.337 r  calc[15]_i_8/O
                         net (fo=1, routed)           0.437     6.774    calc[15]_i_8_n_0
    SLICE_X51Y92         CARRY8 (Prop_CARRY8_SLICEL_DI[1]_CO[7])
                                                      0.422     7.196 r  calc_reg[15]_i_1/CO[7]
                         net (fo=1, routed)           0.030     7.226    calc_reg[15]_i_1_n_0
    SLICE_X51Y93         CARRY8 (Prop_CARRY8_SLICEL_CI_O[7])
                                                      0.228     7.454 r  calc_reg[23]_i_1/O[7]
                         net (fo=1, routed)           0.051     7.505    calc_reg[23]_i_1_n_8
    SLICE_X51Y93         FDRE                                         r  calc_reg[23]/D
  -------------------------------------------------------------------    -------------------

                         (clock clk rise edge)        4.000     4.000 r  
    AG12                                              0.000     4.000 r  clk (IN)
                         net (fo=0)                   0.000     4.000    clk_IBUF_inst/I
    AG12                 INBUF (Prop_INBUF_HRIO_PAD_O)
                                                      0.515     4.515 r  clk_IBUF_inst/INBUF_INST/O
                         net (fo=1, routed)           0.066     4.581    clk_IBUF_inst/OUT
    AG12                 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O)
                                                      0.034     4.615 r  clk_IBUF_inst/IBUFCTRL_INST/O
                         net (fo=1, routed)           0.722     5.337    clk_IBUF
    BUFGCE_X1Y0          BUFGCE (Prop_BUFCE_BUFGCE_I_O)
                                                      0.091     5.428 r  clk_IBUF_BUFG_inst/O
    X2Y1 (CLOCK_ROOT)    net (fo=121, routed)         1.945     7.373    clk_IBUF_BUFG
    SLICE_X51Y93         FDRE                                         r  calc_reg[23]/C
                         clock pessimism              0.538     7.910    
                         clock uncertainty           -0.035     7.875    
    SLICE_X51Y93         FDRE (Setup_HFF_SLICEL_C_D)
                                                      0.063     7.938    calc_reg[23]
  -------------------------------------------------------------------
                         required time                          7.938    
                         arrival time                          -7.505    
  -------------------------------------------------------------------
                         slack                                  0.433

The difference isn't as dramatic as when a designated arithmetic unit was used. But it's still good enough to achieve the timing constraint.

This concludes the general discussion about timing closure. The next page suggests several practical strategies on this topic.