01signal.com

Electronic Exorcism: Why FPGAs sometimes behave as if they are possessed

When things get weird

FPGAs are generally speaking extremely reliable devices, however like any piece of technology, they won't do so well if used incorrectly.

Unfortunately, the available FPGA design tools are quite unhelpful in guiding us humans against making mistakes, which too often leads to unreliable and unpredictable performance. More specifically, the FPGA tools guarantee reliable operation only if certain design practices are followed, and there's an underlying assumption that we know what we're doing. Unlike software compilers, which refuse to generate executable code if there's an error in the source code, FPGA tools just go "here's your bitstream, try it out if you like, and by the way, here are 200 warnings, and if you understand what warning #143 means, you'll realize that there's a serious problem to be fixed".

It's not unusual for people working with FPGAs to feel that the electronics has been possessed by some magic force, and that nothing makes sense. A problem may appear and disappear as a result of completely unrelated changes, and there's no rational explanation in sight. Many properly educated engineers tend to adopt nonsense theories about the FPGAs in the attempt to find some explanation to this weird behavior.

This is "Black Magic Mode": When sensible people stop believing that there's a rational explanation to their problem, and instead look for a fix that is based upon their experience. Everything works fine when the device is really cold? Fine, put a huge heatsink. The problem appears only with some boards and not others? OK, test each board, and throw away those that don't. And so on.

What it looks like

This is a partial list of issues that can drive engineers into an irrational mindset. It's always goes "everything works fine, except when..."

This often comes with a belief that the electronics itself is flawed and doesn't live up to its datasheet. Conspiration theories about how a huge company ships flawed devices aren't unusual.

So this isn't just a bug. Indeed, bugs can drive one crazy, and yet they tend to have some level of repeatability, and they surely don't appear and disappear because of the air condition. I've never heard a software engineer blame a bug on the PC computer itself (even though this is actually the case in very rare occasions). By contrast, mistakes in an FPGA design can definitely cause malfunctions at the hardware level. The distance to blaming the whole world from there is short.

There is a logical explanation. Really.

And it's within reach. Not necessarily easily.

Having worked as a freelancer in this field for quite a while, and fixed situations like this every now and then, believe me when I tell you: Except for extremely rare cases, the FPGA is just fine, and the problem is probably in the bitstream.

The bad news is that quite often, there's more than one flaw in the FPGA design, which may be the reason for the visible problem. Consequently, there might be a lot of things to rectify. Quite a few times, I've been asked to fix an "almost working" FPGA design, soon to realize that I'm facing an unrealistic expectation that "this little problem" will be fixed quickly. Stabilizing a design often means working a lot without any apparent progress.

This way or another, there's no choice. In this page, I'll try to bring back rational thinking, among others by listing possible reasons for what appears to be a ghost in the electronics.

Of course, the best way is to avoid this situation to begin with. Following the Golden Rules that I've listed on a different page is a good start.

Why things get weird

Well, the short answer is that something is wrong with the FPGA design. And when that is the case, there are in principle two possibilities. The relatively fortunate possibility is when there's a visible and steady malfunction of whatever the FPGA is supposed to do. This is like a software bug: Find it, fix it, see that it works after the fix, done.

The less fortunate possibility is that the FPGA works, but it's mostly a lucky coincidence. Hence when some conditions change, the FPGA suddenly stops working properly, and then maybe goes back to working OK. But why would anything change?

So here's the point: An FPGA is a piece of electronics, and as such it's subject to manufacturing tolerances. Even more important, when the temperature of the silicon changes, so does the speed at which the transistors switch state and signals propagate. Changes in the supply voltage also influence how quickly things happen inside the FPGA.

Hence if the device gets a bit warmer or cooler, a signal can arrive to a flip-flop slightly later or earlier relative to the clock. This alone can make this flip-flop miss the signal it should have sampled, or the other way around. This applies even to local heating because some adjacent (and possibly unrelated) logic on the die gets less or more active.

Likewise, FPGA devices have manufacturing tolerances. Even though all meet certain test requirements, some may have faster silicon than others. This is why a design might work on one FPGA and not on the other.

These slight shifts in when the signals toggle can make the difference from something that works perfectly and a catastrophe. Therefore, deviations in manufacturing, temperature and voltages can make a visible difference.

So how is an FPGA reliable? When the FPGA design is made correctly, the FPGA development tools make sure everything always works as defined. Or more precisely, on any FPGA that has passed the manufacturing tests and is used within the requirements in the datasheet — which primarily boils down to ambient temperature within range, cooling as necessary, and correct voltages at the FPGA's pins.

However if the required FPGA design practices aren't followed, neither do the tools ensure proper operation. Which means that any of the factors that shouldn't make any difference turn into the make-or-break of the entire board, and no weirdness is beyond reach.

At the risk of repeating myself, here are a few examples:

This is worth repeating over and over again: When the FPGA design is done properly, none of this ever happens. Or at least extremely rarely. Few people realize how reliable electronics is when the datasheets are read and followed, and the FPGA is used correctly as well.

But I guess this preaching is a bit too late for those reading this page — the problem is already there. So based upon my own experience, I've listed some common reasons for a seemingly haunted FPGA. If you're facing such a problem, there's a good chance it's one of these.

Reason #1: Timing

To a large extent, the tools ensure the stable operation of the FPGA by meeting the timing constraints that you have given to them. This is the deal between you and the tools: You express the timing requirements accurately, and the tools make sure they are achieved on any FPGA you may use, under the device's full range of temperatures and voltages.

It's not unusual that the timing constraints are just a single clock period constraint, stating the reference clock's frequency. This might be good enough, but if you've just copied this line from some other design and hey, it works, there's a good reason for a review.

Actually, this is about getting the timing right in general. And it's not a trivial task, not even for the most experienced FPGA designer. It means being sure that every single signal path in the design is subject to a constraint that ensures that the flip-flop at the end always samples the signal right. Except for those paths that don't need constraining.

So the first thing to check: Did the design meet the timing constraints? This is really basic, but since most tools generate a bitstream regardless, FPGA newbies can fall on this simple one.

The next thing is to review the timing constraints. There's a separate page discussing such a checkup, but to make a long story short: Do you understand exactly what they mean? Do they mean exactly what they should? If there are selective timing constraints — that is, they cover some paths specifically with filter conditions — do they indeed target the correct paths?

And then, the timing report should be read carefully. Once again, this separate page elaborates on this topic.

Another thing to look at is crossing clock domains. Are clock domains crossed unsafely, possibly due to lack of attention on what logic relies on which clock? And if clock domain crossings are done with anything else than vendor-supplied FIFOs, are they done safely?

Reason #2: Improper resets

It may not seem related, but if the logic's initial state isn't ensured, it might very well be the source of a black magic touch.

The rule is simple: If you haven't given resets and the logic's wakeup a serious thought, odds are that you got it wrong.

In particular, if your view of resets is to write clauses like

   always @(posedge clk or negedge resetn)
     if (!resetn)
       the_reg <= 0;
     else
       the_reg <= [ ... ] ;

without explicitly taking care of what happens when @resetn is deasserted, you should definitely look at this page.

Either way, it's a good idea to check that there are resets where they should be, and that they do their job properly. What this means is discussed in a short series of pages on this topic.

Reason #3: Clocking

The quality of clocks is possibly the most underrated topic in digital design. It's often like "yep, it toggles, let's use it as a clock".

FPGA logic should be clocked with a clock that is known to be stable and have an adequate jitter performance. No less important, the physical connection of the clock to the FPGA must be stable and reliable.

So whatever is used as @clk in a statement like

always @(posedge clk)

must be treated with great care. Ideally, this clock originates in a dedicated clock generator device (oscillator), which ensures a stable and low-jitter clock. Generally speaking, it's better to use this external clock as a reference to a PLL, rather than to connect it directly to logic. This holds true even if the PLL doesn't change the clock's frequency.

This is because it allows monitoring the PLL's lock detector and hold the logic that depends on it in reset when the PLL goes out of lock. Doing so significantly reduces the chances for problems if the reference clock has stability issues (in particular soon after powerup).

However in some situations, the PLL may appear to cause unnecessary resets as it loses lock sporadically. The quick fix can appear to be to remove the PLL and connect the external clock directly to logic, and then everything appears to work fine. In a case like this, odds are that there's an issue with the reference clock. Removing the PLL in this case doesn't solve the problem, but rather pushes it into the logic fabric, possibly opening for a black magic situation.

But clocks from dedicated oscillators is really the easy case. It gets worse with other sources.

Clocks that are created by a processor or by one of its peripherals should be used carefully, if at all: They may halt or produce illegal waveforms momentarily, possibly because their related hardware registers are written to by software. Such brief transients may not be visible when examining the clock with an oscilloscope, but nevertheless cause weird malfunctions.

Another common source of problems is poor handling of source-synchronous clock / data inputs, i.e. when an external device supplies a clock signal and one or more data signals, so that the data is synchronous with the clock. Typically the data signals may toggle only the clock's rising edges (or only on falling edges).

A common, but rather dangerous solution, is to use the source-synchronous clock directly to drive logic in the FPGA. This is problematic partly because they are often not thought to be used as continuous clocks, and may therefore stop momentarily or generate spurious pulses.

Source-synchronous interfaces are often connected to the FPGA through a physical connector, e.g. when the data source is a camera that is connected to the main board through a cable. Connectors are usually reliable, but even the loss of physical contact for a nanosecond due to vibration can be enough to result in an illegal toggle of a clock signal. This can happen with data signals as well, of course, but is usually less significant, in particular when the data source is a camera. However when such clock signal drives logic directly, a one-nanosecond pulse can definitely stir a mess.

The best solution for source-synchronous clock / data interfaces is synchronous sampling of both the clock and the data with a safe, and significantly faster clock. Preferably, this is done with the dedicated registers adjacent to the I/O pins. So both the clock and the data are treated by the FPGA's logic as data, and the clock transitions are detected as a register that changes its value: Plain synchronous logic, that runs on the faster clock, detects positive or negative edges of the input clock (whichever applies) simply by noting when the relevant register's value was '0' on the previous cycle and '1' on the following (or vice versa). When such transition is detected, this logic passes through the samples of the data inputs for processing.

The clear advantage of this method is that no matter what happens with the clock signal, the FPGA's logic continues to rely on a safe clock. It's up to the logic that detects the edges to respond adequately if the data-related clock goes crazy.

This technique is possible for relatively low source clock frequencies (say, up to 200-300 MHz, if DDR sampling is employed, depending on the FPGA's speed).

For faster sources, the preferred solution is to feed a PLL with the source's clock, and use the PLL's output for use by the logic. As mentioned above, the logic should be reset when the PLL indicates that it's not locked. This is likely to be the solution for a different reason as well: When the frequency is too high for synchronous sampling as just suggested, odds are that the only way to ensure proper sampling is by finding the timing by virtue of phase shifting the sampling clock, and detect an error-free sampling range while the data source is toggling. This technique requires involving a PLL anyhow.

Reason #4: Violating the rules of RTL design

Proper Verilog (or VHDL) code for synthesis has to follow some strict rules, in particular the RTL (Register Transfer Level) paradigm. Among others, this means that any storage element (e.g a flip-flop) changes value only as a result of a clock edge. The only exception to this is an asynchronous reset, which can't be just any signal.

When a synthesizer encounters Verilog code that violates these rules, it usually plays along regardless, and generates some logic that might and might not implement what a simulation would have presented. In some cases, the synthesized logic appears to implement the expected logic, but it may fail randomly.

For example, consider this wrong design for a counter between 0 and 14:

reg [3:0] counter;
wire      reset_cnt;

assign reset_cnt = (counter == 15); // This is so wrong!

always @(posedge clk or posedge reset_cnt)
  if (reset_cnt)
    counter <= 0;
  else
    counter <= counter + 1;

The horrible mistake is to use @reset_cnt as an asynchronous reset.

But let's begin with explaining how this should have worked, and actually will work on simulation: @counter is incremented on rising edges of @clk, but when it reaches the value 15, @reset_cnt is asserted and resets @counter back to zero asynchronously. Hence when @counter is sampled with @clk, it displays the values 0 to 14, as it should.

In hardware this might and might not work. The problem is that @reset_cnt is a combinatoric expression of @counter. So when @counter changes value from 7 to 8, the logic that calculates @reset_cnt might see @counter as 15 briefly. This is because 7 is 0111 in binary, and 8 is 1000. So if bit 3 has the shortest propagation delay to the logic the calculates @reset_cnt, this signal might be asserted briefly. As a result, @counter will count from 0 to 14 sometimes, and sometimes from 0 to 7. Temperature and other unrelated factors can influence this wiggling logic.

This explanation to why this example is wrong is greatly simplified however. The tools are free to implement combinatoric logic in the most obscure ways, so virtually anything can happen between clock edges. The only thing that the tools guarantee is that the signals are stable so as to meet the required setup and hold specifications.

So unless the logic design follows the rules for RTL design strictly, weird things can definitely happen.

Reason #5: Temperature and power supplies

This isn't a common reason, and it's easy to check, and nevertheless can it be the root cause for odd problems.

Quite naturally, if the FPGA's silicon's temperature is outside the allowed range, nothing is guaranteed to work. Overheating due to insufficient thermal planning, or fans struggling with dust are the most common reasons.

As for power supplies, they might produce faulty output for various reasons. A simple check with an oscilloscope often reveals whether the voltage is within the specified range. Note however that the voltage should stay within this range at all times. It's not enough that the average voltage is correct, but that neither the natural noise of switching power supplies nor incidental spikes exceed the limits.

Note that even though a 1 μs long spike may seem innocent, it's tens to hundreds clock cycles inside the FPGA, so it's a significant period of time during which the FPGA is fed with an incorrect voltage. Preferably measure the voltage at the decoupling capacitors close to the FPGA to see the voltage that actually arrives. Also be sure to set the oscilloscope's trigger at the upper and lower voltage limits, and be sure that the oscilloscope doesn't trigger on these. Brief spikes are easily missed on the oscilloscope's display, but the trigger will get them.

Sometimes power supply issues are a direct result of poor design. Many power supply modules have a minimal current that is often overlooked. If this minimal current isn't drained from the power supply module, it may become unstable and produce a voltage that doesn't meet its spec, or even worse, it might oscillate occasionally.

Another common mistake is to put a switching power supply where a voltage regulator is necessary. In particular, some low-jitter clock oscillators require a very clean input. If such is driven by a noisy power supply, this noise is reflected into jitter on the clock output. If a Gigabit transceiver is driven by this clock (for e.g. PCIe, USB 3.x, fiber optics etc.) this often results in an unreliable data link.

Likewise, when DDR memories are part of the design, a reference voltage power supply is required. This voltage is used by both the FPGA and the DDR memories as the voltage limit between a logic '0' and '1' for the wires going between these two. If this voltage is generated by a switching power supply, odds are that the voltage supply noise will make it harder or even impossible to transmit data error-free between the FPGA and memories.

Reason #6: Are you kidding me?

Sometimes, the reason for the black magic situation is a flaw so big, that one wonders how anything worked at all. For example, when signals on the PCB are completely disconnected from the relevant FPGA pin, and yet they keep driving the correct signal into the FPGA by virtue of crosstalk or parasitic capacitance.

This tends to happen in particular with clocks, because they are often all over the board, and the fact that they are periodic signals improves their chances to get across good enough to appear to be OK.

So by all means, take an oscilloscope and check all clocks as close as possible to the FPGA. If there's an AC coupling capacitor for the clock, it's a good place to probe, in particular because you may discover that the capacitor is missing.

Reason #7: Plain bugs

Or more precisely: The design was never made to work. At no point had anyone sat down and figured out how the logic is ensured to do its work. Instead, the code was incrementally written with trial and error, partly with simulations and partly against hardware. This process stopped when things appeared to work fine, but looking at the code it seems like a miracle: Because it has been patched so many time to fix just that little thing, it's impossible to follow what's going on, let alone make changes.

I've put this reason last, because it's not really a black magic thing. It's just a very annoying bug. Nevertheless, this is the most common reason FPGA projects get stuck.

If you still think it's the FPGA's fault

Sometimes it's not your fault. There might be a bug in the FPGA itself or its vendor's software. This happens by far less often than people tend to blame the FPGA vendor, but on rare occasions, this is indeed the case.

Because of the natural temptation to blame someone else, do yourself a favor, and don't wrap up the exorcism session by blaming the FPGA, unless you have one of these two, or both:

If you do wrap up without any of these and manage to somehow work around the problem, there's a good chance you'll meet it again later.

The best example I have on an FPGA bug was a long time ago, on Xilinx' Virtex-4 hardware FIFO — that is, a dual clock FIFO which had the FIFO-related logic implemented directly in silicon (as opposed to on logic fabric).

The data flow through this FIFO got stuck every now and then. After some investigation, it turned out that the FIFO held both its empty and full signals asserted after running for a while. This is an illegal condition, unless the FIFO is held in reset, which it wasn't. So after making absolutely sure that I was observing the correct signals, I closed the case and went for FIFOs that are implemented in logic fabric instead.

A few months later, I found an errata record on these FIFOs which I wouldn't have understood unless I knew about the problem beforehand. But after reading the description very carefully, I could conclude that it confirmed my observation.

Summary

When the FPGA appears to defy the laws of nature, it's tempting to adopt explanations that divert from common sense. It's nevertheless important to look for a rational explanation — and quite often this explanation can be found without the need for superpowers.

Hunting down the reason might however require a thorough review of the design, which isn't necessarily a bad thing. Frustrating as such a hunt may be, it may contribute significantly to the design's quality regardless.

Copyright © 2021-2022. All rights reserved. (42e6e8c4)