01signal.com

Metastability and the basics of crossing clock domains

This page is the second of three in a series on clock domains.

Scope

As already mentioned in the previous page, if a path crosses clock domains of related clocks, no special treatment is required by the logic. The design tools, on the other hand, must be made to ensure that the synchronous element at the destination samples the data signal with legal setup and hold times. Or simply put, to apply timing constraints. So this case is handled as if there was no clock domain crossing at all, in the sense that the path is timed.

However if the clocks are unrelated, there must be a resynchronization mechanism implemented in logic: Some piece of Verilog code must be dedicated to solve this issue. This page goes through the basics of this mechanism.

Use a FIFO if you can

I thought it would be just fair to begin with how to avoid this headache altogether, in favor of those who have the possibility to escape. Namely:

Dual-clock FIFOs, which are generated by the FPGA vendor's tools, are definitely the safest way to cross clock domains. This is the most common solution, and the fact that it's so widely used, is by itself a reason to trust it.

In particular, if you're new with FPGAs, there's a good reason to prefer a FIFO, even if you'll need to bend your own design a bit for that purpose. Using a FIFO might consume slightly more resources than hand-written logic. Note however that a block RAM is usually not required if the FIFO's depth is shallow. So the difference isn't necessarily worth the risk of messing up.

Often the FIFO is the natural solution for crossing clock domains, in particular when there's a stream of data going from one functional unit to another. But even when it's not all that natural, it's often possible to reorganize the logic somewhat to make it fit. For example, if logic in one clock domain needs to notify logic in another clock domain about some event, this can be done by encoding a message word and push it into a shallow FIFO. A solution of this sort isn't only less bug-prone, but it's also likely to result in a more organized and readable design.

How to connect and use FIFOs is discussed extensively in this series of pages.

Resynchronizing a single bit signal

Given that a FIFO doesn't solve your problem, it's time to roll up our sleeves and implement the resynchronization logic for a safe clock crossing. The rest of this page is limited to crossing clock domains with a signal that is a single bit wide, as this is the cornerstone for any resynchronization logic. And there's plenty to say about this seemingly simple case.

The next page in this series discusses how to cross clock domains with wider signals, based upon the metastability guard technique that is presented below.

On metastability

Before moving on to solutions, it's important to understand what happens when the timing requirements of a flip-flop are violated. In other words, when the signal at its data input isn't stable at the time span that starts tsu before the clock edge, and ends thold after it. The clock edge in question is of course the one at the flip-flop's clock input, that causes it to sample the data input.

Clearly, if the data changes during this time span, the output of the flip-flop after this clock edge is unpredictable. But it's even worse than so: It may take the flip-flop significantly longer than its usual clock-to-output time to present a legal '0' or '1' at its output. Even though the electronic design of a flip-flop is made to hold it stable on one of these two possible outputs, '0' or '1', the violation of the timing requirements can make it stay in an unstable condition for a short while. Or, to use the official term, the flip-flop is metastable.

I won't skip the cliché image with the ball standing at the tip of a hill, which is often used to depict the unstable situation that the flip-flop can reach:

Ball on tip of hill, illustrating metastability

Metastability is a much bigger deal than just the uncertainty of what the flip-flop eventually lands on, in particular if the flip-flop's output drives more than one destination: Because landing on either a '0' or '1' takes more time than usual, it might arrive too late at the destination flip-flops. As a result, some flip-flops may sample the wobbling signal as '0' and others as '1', possibly bringing the logic to an otherwise impossible state, and black magic behavior is just around the corner. So a possibly metastable flip-flop should never drive more than one logic element.

To tackle metastability, it would have been nice to know how long it takes for the flip-flop to fall into one of its stable states. Unfortunately, there's no definite answer to that. It's exactly like asking how long the ball will stay at the top of the hill: It depends on a lot of factors, and odds are that random vibrations will eventually make it fall down this way or the other. Same with a flip-flop's metastability: It will be brought out of this state by virtue of random noise in the electronic circuits, or whatever else.

So in theory, a flip-flop can remain in a metastable condition forever, but in reality it will fall into one of its stable states after a short while. The time it stays in the metastable condition is a random variable, and plenty of experiments and simulations have been made in the attempt to estimate it. However none of these is really relevant, since the behavior depends on the silicon manufacturing technique, temperature, the noise level from crosstalk signals and whatnot.

So once again, even though it would have been convenient to have a specified maximal time that the flip-flop will never exceed in the metastable state, there is no such. It's not even possible to come with a general ballpark figure, because faster silicon manufacturing processes make flip-flops that tend to leave the metastable state faster.

Instead, here's the idea to get used to: When you cross clock domains with unrelated clocks, there's always a chance that some flip-flop will remain metastable longer that the design was made to tolerate, and something will go wrong as a result of that. The only thing we can do as designers is to reduce this chance. To achieve an MTBF (Mean Time Between Failure) one can live with.

Luckily, there's a well-established technique to achieve exactly that, which brings us to the next topic.

The metastability guard

To make a long story short, let's revisit the first code example from the previous page. Assuming that @clk1 and @clk2 are unrelated clocks, the common resynchronization logic for obtaining a stable @bar from @foo is

reg foo, bar, bar_metaguard;

always @(posedge clk1)
  foo <= !foo;

always @(posedge clk2)
  begin
    bar_metaguard <= foo;
    bar <= bar_metaguard;
  end

As its name implies, @bar_metaguard is a metastability guard. The idea is that the timing requirements of the flip-flop that implements @bar_metaguard are occasionally violated as it samples @foo, so it may become metastable. As it's expected to recover from this quickly, it will be stable soon enough to meet the setup timing requirement for @bar. As a result, @bar can be used reliably inside @clk2's domain.

If this explanation sounds fluffy, it's indeed so. It actually contradicts what I wrote about metastability above. I'll come back to that further below. But for now, let's stick to the common practice, which is to add the metastability guard as shown above, and not worry about it anymore. And truth to be told, I've never heard about someone having problems with this.

Those who want to be extra safe, add more guard registers. So a double metastability guard goes something like this:

reg foo, bar, bar_metaguard_a, bar_metaguard_b;

always @(posedge clk1)
  foo <= !foo;

always @(posedge clk2)
  begin
    bar_metaguard_a <= foo;
    bar_metaguard_b <= bar_metaguard_a;
    bar <= bar_metaguard_b;
  end

@bar_metaguard_a is the first metastability guard. If we're unlucky, it remains metastable too long, and @bar_metaguard_b's timing requirements are violated, and the latter may become metastable on the next clock cycle. But this time the metastability phase is brief, we hope: Lightning never strikes in one place twice, they say. But of course @bar_metaguard_b can also remain metastable so as to violate @bar's timing. But what are the chances? Remember that the goal is to obtain a reasonable MTBF.

To summarize: If you're looking for the cookbook answer on how to cross a clock domain with a single-bit signal, this is it. One or two metastability guards will do the job nicely. If you're looking for a failproof solution, unfortunately there is none. And if you want to reduce the probability for mishaps to the extent possible, read on.

Timing analysis of metastability

So now let's go back to the single metastability guard example above, and ask why I was so sure that @bar_metaguard recovers from metastability quickly enough. The answer, as already mentioned, is that there's no reason to be sure.

Nevertheless, let's make a bit of a timing analysis. Both @bar_metaguard and @bar are clocked by the same clock. Assuming that no special timing constraints have been assigned to these specifically, the tools will apply the clock period timing constraint for @clk2 on the path between them. In other words, the tools ensure that in the absence of metastability, @bar samples its input signal with legal timing.

But the whole point with @bar_metaguard was that it can get metastable every now and then. So the timing calculation is insufficient: Effectively, the time that the flip-flop spends in the metastable state is an addition to its clock-to-output time. In other words, metastability means that the output of the flip-flop becomes stable later than its specified delay. So if the path between @bar_metaguard and @bar has a near-zero slack (i.e. its propagation delay is good enough to meet timing, but not by much), metastability may very well bring the total path delay above the allowed limit, and cause a timing violation on @bar's input. Of course this applies to the worst case in temperature, voltages and manufacturing process, and yet: The fact that the tools apply the regular timing constraint on the metastability guard's output means that it doesn't ensure the necessary timing requirements.

Luckily, this is easy to fix. Or improve, I should say. For example, if all metastability guard registers have the *_metaguard suffix, a single timing constraint can be written to require that all paths from them get some extra time for stabilizing. For Vivado, such constraint would say:

set_max_delay -from [ get_cells -hier -filter {name=~*_metaguard*} ] 0.75

This simply means that any path that starts at a metastability guard (pin-pointed by its suffix) has 0.75 ns to reach its destination. This is the minimum figure I managed to get without failing the timing constraint on a specific FPGA I tried this on. YMMV. The way to reach this it is to pick a lower number until timing fails on this constraint, and then increase it as necessary to achieve a very small slack (say, 0.2 ns). This forces the tools to do their very best on these paths.

In Vivado's interpretation of set_max_delay, it's equivalent to request a clock period of 0.75ns (1333 MHz). So if, for example, the actual clock period is 250 MHz (4 ns), meeting this constraint gives an extra of at least 4 – 0.75 = 3.25 ns. Hence if the metastability guards are in metastable state for up to 3.25 ns, no timing violation will occur, thanks to this set_max_delay constraint.

This is of course better than no assurance at all, but is 3.25 ns enough? Is it much? Is it enough to ensure a virtually failsafe crossing of clock domains?

As already said, the answer to this depends on several factors. Judging by experiments that have been published, my hunch is that it's extremely unlikely to see a flip-flop remain in metastable condition for even as long as 1 ns on any FPGA that you and I will use. But it's difficult to say anything solid on this matter.

But having this constraint, and hence pushing the tools to do their best, puts your FPGA design in line or better than everyone else's situation. So in the spirit of Golden Rule #4 (don't push your luck), put yourself in a position that if it fails for you, a whole lot of other people will have a reason to complain.

Another takeaway from this discussion on timing is that if the clock's frequency is high in terms of the targeted FPGA, that double metastability guard might be a good idea: As mentioned above, metastability consumes the path's slack, which diminishes along with the clock period's time.

Finally, let's ask how come virtually everyone ignores this whole timing issue, and yet nobody complains about problems.

So the first explanation is that there's a good chance that the tools will place the metastability guard very close physically to the flip-flop it feeds on the FPGA's logic fabric. Hence the propagation delay between these two (e.g. @bar_metaguard to @bar) is very short in terms of that FPGA anyhow. This tight placement is however not ensured without explicit constraining, as discussed above.

Inside FPGA vendors' official FIFOs, pairs of this sort are often found on the same slice, and regardless, official FIFOs can be trusted to have this issue sorted out.

Another thing is that as FPGAs become faster, and higher clock frequencies become possible, it turns out that the flip-flops also get out of metastability quicker. So the time it takes to recover from metastability plus the propagation delay, still remains significantly shorter than the clock period of the sampling clock.

So it's no surprise that one can get away with just adding the metastability guard and call it a day. That's however not an excuse for being lazy about adding a constraint.

Limitations on the clock frequencies

I haven't said much so far about the frequencies of the clocks in either clock domain. For example, what happens if the clock in the source domain (@clk1) is faster than the destination's (@clk2)?

Since there's no expectation on any synchronization between the clocks, it doesn't matter if the source frequency is higher or lower per se. But if the source signal (@foo) toggles too fast, it might go back and forth before it had any chance to be noticed on the destination's clock domain. So if all transitions must be visible at the destination, the source clock's period must be slightly longer (i.e. have a lower frequency) than the destination clock's.

How much longer? Not by very much. Or if you insist on an accurate answer: That depends primarily on the sampling flip-flop's timing requirements. If its input signal toggles tsu before that flip-flop's sampling clock, it's on the edge of being sampled properly, so it must be sampled on the next clock cycle to avoid being missed. To guarantee that, the input signal must remain steady until thold after the next sampling clock. So the source clock's period must be longer than the destination clock's period by tsu + thold + 2tj, where tj is the uncertainty in the clock's period (jitter). It's counted twice, once for the source clock and second time for the destination.

So tsu + thold + 2tj is what I meant with "slightly longer" clock period for the source clock.

This concludes the second page in this series. The next page shows techniques for passing data across clock domains.

Copyright © 2021-2022. All rights reserved. (42e6e8c4)