01signal: Electronic Exorcism: Why FPGAs sometimes behave as if they are possessed

When things get weird

FPGAs are generally speaking extremely reliable components, however like any piece of technology, they won't do so well if used incorrectly.

Unfortunately, the available FPGA design tools are quite unhelpful in guiding us humans against making mistakes, which too often leads to unreliable and unpredictable results. More specifically, the FPGA tools guarantee reliable operation only if certain design practices are followed, and there's an underlying assumption that we know what we're doing. Unlike software compilers, which refuse to generate executable code if there's an error in the source code, FPGA tools just say: "Here's your bitstream, try it out if you like, and by the way, here are 200 warnings, and if you understand what warning #143 means, you will also understand that there's a serious problem that needs fixing".

It's not unusual for people working with FPGAs to feel that the electronics has been possessed by some magic force, and that nothing makes sense. A problem may appear and disappear as a result of completely unrelated changes, and there's no rational explanation in sight. Many properly educated engineers tend to adopt nonsense theories about the FPGAs in the attempt to find some explanation to this weird behavior.

This is "Black Magic Mode": When sensible people stop believing that there's a rational explanation to their problem, and instead look for a fix that is based upon their experience. Everything works fine when the FPGA is really cold? Fine, put a huge heatsink. The problem appears only with some boards and not others? OK, test each board, and throw away those that don't. And so on.

What it looks like

This is a partial list of issues that can influence engineers into an irrational mindset. It's always goes "everything works fine, except when..."

... it suddenly doesn't, for no apparent reason.
... a random, unrelated change is made in the Verilog code (or VHDL), or in the settings or version of the FPGA tools.
... the air condition is on / off.
... the system has been working for a while / is powered on from cold.
... the FPGA components from another manufacturing batch are used.
... some other component from another manufacturing batch is used.
... I put my finger here / my hand over that / press that unrelated button.

This often comes with a belief that the electronics itself is flawed and doesn't live up to its datasheet. Conspiracy theories about how a huge company ships flawed components aren't unusual.

So this isn't just a bug. Indeed, bugs can drive one crazy, and yet they tend to have some level of repeatability, and they surely don't appear and disappear because of the air condition. I've never heard a software engineer blame a bug on the PC computer itself (even though this is actually the case in very rare occasions). By contrast, mistakes in an FPGA design can definitely cause malfunctions at the hardware level. The distance to blaming the whole world from there is short.

There is a logical explanation. Really.

And it's within reach. Not necessarily easily.

Having worked as a freelancer in this field for quite a while, and fixed situations like this occasionally, believe me when I tell you: Except for extremely rare cases, the FPGA is just fine, and the problem is probably in the bitstream.

The bad news is that quite often, there's more than one flaw in the FPGA design, and these flaws may be the reason for the visible problem. Consequently, there might be a lot of things to fix. Quite a few times, I've been asked to fix an "almost working" FPGA design, soon to realize that I'm facing an unrealistic expectation that "this little problem" will be fixed quickly. Stabilizing a design often means working a lot without any visible progress.

This way or another, there's no choice. In this page, I'll try to bring back rational thinking, among others by listing possible reasons for what appears to be a ghost in the electronics.

Of course, the best way is to avoid this situation to begin with. Following the Golden Rules that I've listed on a different page is a good start.

Why things get weird

Well, the short answer is that something is wrong with the FPGA design. And when that is the case, there are in principle two possibilities. The relatively fortunate possibility is when there's a visible and steady malfunction of what the FPGA is supposed to do. This is like a software bug: Find it, fix it, see that it works after the fix, done.

The less fortunate possibility is that the FPGA works, but it's mostly a lucky coincidence. Hence when some conditions change, the FPGA suddenly stops working properly, and then maybe goes back to working OK. But why does this happen?

So here's the point: An FPGA is a piece of electronics, and as such there are inaccuracies during its manufacturing. Even more important, when the temperature of the silicon changes, so does the speed at which the transistors change their state and the same is also true for the speed at which the signals propagate on the logic fabric. Changes in the supply voltage also influence how quickly things happen inside the FPGA.

Hence if the FPGA gets a bit warmer or cooler, a signal can arrive to a flip-flop slightly later or slightly earlier, relative to the clock. This alone can make this flip-flop miss the signal that it should have sampled, or sample the signal that the flip-flop usually misses. Even local heating of the silicon can cause this, when adjacent (and possibly unrelated) logic on the chip gets less or more active.

Likewise, FPGAs have manufacturing inaccuracies. Even though all FPGAs that leave the factory have passed tests that ensure that they are compatible with their specifications, some FPGAs may have faster silicon than others. This is why an improperly made logic design might work on one FPGA and not on the other.

All these random parameters influence when the small components change their state on the logic fabric, and that difference in timing can make the difference between something that works perfectly and a catastrophe. Therefore, small deviations in manufacturing, temperature and voltages can make a visible difference.

So how can an FPGA be reliable? When the FPGA design is made correctly, the FPGA design tools make sure that everything always works as defined. Or more precisely, that everything works on any FPGA that has passed the manufacturing tests and is used within the requirements in the datasheet. This boils down to having the ambient temperature within the required range (with cooling as necessary), and correct voltages at the FPGA's pins.

However if the required FPGA design practices aren't followed, neither do the tools ensure proper operation. Which means that the parameters that shouldn't make any difference become crucial, and the entire board stops working and resume working, depending on things that shouldn't matter at all. There's no limit to how weird things can get.

So at the risk of repeating myself, here are a few examples:

Everything can work perfectly, and then you make a trivial change, run implementation again, load the new bitstream, and boom, it's a complete failure. This is usually because the logic is placed differently on the FPGA's logic fabric, and the same happens with the routing between the logic elements. As a result, the propagation delays change, and some signals that previously arrived with proper timing to its destination don't do so anymore.
You have three identical boards: One board works perfectly, the second board works only in the mornings, and the third board never works. This is probably because of slight differences in the FPGAs' silicon. As a result, somewhere in the design there's a signal inside the FPGA that arrives to its destination with proper timing on the first board. On the second board, the margin is smaller, so a different ambient temperature brings the FPGA across the threshold from working to not working. And the third FPGA is on the wrong side of this threshold all the time.
"I turn on this thing and that other thing stops working, but they are completely unrelated". These often has to do with temperature. Even if the logic of one function is completely unrelated to another, one active logic element can heat up its neighbors.

This is worth repeating over and over again: When the FPGA design is done properly, none of this ever happens. Or at least extremely rarely. Few people realize how reliable electronics is when the datasheets are read and followed, and the FPGA is used correctly as well.

But I guess this preaching is a bit too late for those reading this page — the problem is already there. So based upon my own experience, I've listed some common reasons for a seemingly haunted FPGA. If you're facing such a problem, there's a good chance it's one of these.

Reason #1: Timing

To a large extent, the tools ensure the stable operation of the FPGA by achieving the timing constraints that you have given to them. This is the deal between you and the tools: You express the timing requirements accurately, and the tools make sure they are achieved on any FPGA you use, as long as the FPGA's operates inside its range of allowed temperatures and voltages.

It's not unusual that the timing constraints are just a single constraint, which contains the reference clock's frequency. This might be good enough, but if you've just copied this row from some other design and hey, it works, there's a good reason for a review.

Actually, this is about doing the timing correctly in general. And it's not a trivial task, not even for the most experienced FPGA designer. It means being sure that every single signal path in the design is controlled by a constraint that ensures that the flip-flop at the end always receives the signal correctly. Except for those paths that don't need constraining.

So the first thing to check: Did the design achieve the timing constraints? This is really basic, but since most FPGA tools generate a bitstream regardless, FPGA newbies can fall on this simple mistake.

The next thing is to review the timing constraints. There's a separate page discussing such a checkup, but to make a long story short: Do you understand exactly what the timing constraints mean? Is their meaning exactly as it should be? If there are selective timing constraints — that is, they cover some paths specifically with filter conditions — do they indeed work on the correct paths?

And then, the timing report should be read carefully. Once again, this separate page elaborates on this topic.

Another thing to look at is clock domain crossings. Are there signals that go from one clock domain to another unsafely? This could be due to lack of attention on which logic is related to each clock. Are clock domain crossings done only with the FIFOs that are created by the FPGA tools? If not, are the crossings done correctly and safely?

Reason #2: Improper resets

It may not seem related, but if the logic's initial state isn't ensured, it might very well result in black magic behavior.

The rule is simple: If you haven't given resets and the logic's wakeup a serious thought, odds are that you got it wrong.

In particular, consider this example:

   always @(posedge clk or negedge resetn)
     if (!resetn)
       the_reg <= 0;
     else
       the_reg <= [ ... ] ;

If your view of resets is to write code like this, without explicitly taking care of what happens when @resetn is deactivated (i.e. changes to high in this example), you should definitely look at this page.

Either way, it's a good idea to check that there are resets where they should be, and that they do their job properly. What this means is discussed in a short series of pages on this topic.

Reason #3: Clocking

The quality of clocks is possibly the most underrated topic in digital design. It's often like "yep, it changes from high to low and back, let's use it as a clock".

The clocks that are used with FPGA logic should be stable and have an adequate jitter performance. No less important, the physical connection of the clock to the FPGA must be stable and reliable.

So in a statement like this,

always @(posedge clk)

whatever is used as @clk must be treated with great care. Ideally, this clock originates in a dedicated clock generator components (oscillator), which ensures a clock that is stable and has low jitter. Generally speaking, it's better to use this external clock as a reference clock to a PLL, rather than to connect it directly to logic. This holds true even if the PLL doesn't change the clock's frequency.

This is because using a PLL allows monitoring the PLL's lock detector. Hence the logic that depends on this clock can be held in reset state when the PLL is unlocked. Doing so significantly reduces the chances for problems if the reference clock has stability issues (in particular soon after powerup).

A PLL can be mistaken for a troublemaker, and cause apparently unnecessary resets because it loses its lock sporadically. The can be mistakenly "fixed" by removing the PLL and connecting the external clock directly to the logic, and then everything appears to work fine. In a case like this, odds are that there's an issue with the reference clock. Removing the PLL in this case doesn't solve the problem, but rather pushes it into the logic fabric, possibly causing for a black magic situation.

So far I've discussed clocks from dedicated oscillators, which is really the easy case. It gets worse with other sources: Clocks that are created by a processor or by one of its peripherals should be used carefully, if at all. Such clocks can be halted momentarily by the processor, or produce illegal waveforms occasionally. This can be because software writes to the relevant hardware registers, possibly as part of an unrelated task. Such brief events may not be visible when examining the clock with an oscilloscope, but these short events cause weird malfunctions nevertheless.

Another common source of problems is poor handling of a source-synchronous clock. In other words, when an external component supplies a clock signal and one or more data signals, so that the data is synchronous with the clock. Usually the data signals may change value only together with the clock's rising edges (or only on falling edges).

A common, but rather dangerous method, is to connect the source-synchronous clock directly to application logic inside the FPGA. Part of the problem with this is that the source-synchronous clock was often not intended to be used as a continuous clock, and may therefore stop momentarily or have spurious pulses.

Another possible problem is that source-synchronous interfaces are often connected to the FPGA through a physical connector, e.g. when the data source is a camera that is connected to the main board through a cable. Even though connectors are usually reliable, even the loss of physical contact for a nanosecond due to vibration can be enough to result in an illegal pulse on the clock signal. This can happen with data signals as well, of course, but is usually less significant, in particular when the data source is a camera. However when such clock signal is connected to the application logic directly, a one-nanosecond pulse can definitely stir a mess.

The best solution for source-synchronous interfaces with clock and data is therefore to treat both the clock and the data as regular signals. Accordingly, both the source-synchronous clock and the data signals are sampled with flip-flops, using a significantly faster clock that is stable and safe. Preferably, this is done with the dedicated flip-flops that are adjacent to the I/O pins.

When the source-synchronous clock changes from low to high, this is reflected by a similar change in the output of the flip-flop that is sampling this signal. Accordingly, the rising edges of the source-synchronous clock can be detected with synchronous logic, by the simple fact that the output of this flip-flop changes from low to high. This logic is of course based upon the faster and stable clock. When this logic detects a rising edge like this, it marks the data as valid. In other words, the outputs of the flip-flops that contain the values of the data inputs are marked as valid data.

The clear advantage of this method with 01-signal sampling is that no matter what happens with the clock signal, the FPGA's logic continues to rely on a safe clock. It's up to the logic that detects the edges to respond adequately if the source-synchronous clock goes crazy.

This technique is possible for relatively low frequencies of the source-synchronous clock (typically up to 200-300 MHz, depending on the FPGA's speed, and if DDR sampling is used).

For faster sources, the preferred solution is to feed a PLL with the source's clock, and use the PLL's output with the application logic. As mentioned above, the logic should be reset when the PLL indicates that it's not locked. This is likely to be the correct solution for a different reason as well: When the frequency is too high for 01-signal sampling as just suggested, odds are that the only way to ensure proper sampling is by finding the timing by virtue of phase shifting of the clock. This means that the timing is adjusted automatically by the logic until no errors are detected in the signals that are sampled. This technique requires using a PLL anyhow.

Reason #4: Violating the rules of RTL design

Proper Verilog code (or VHDL) for synthesis has to follow some strict rules, in particular the RTL paradigm (Register Transfer Level). Among others, this means that any logic element that is some kind of memory (e.g a flip-flop) changes value only as a result of a clock edge. The only exception to this is an asynchronous reset, which can't be just any signal.

When a synthesizer encounters Verilog code that violates these rules, it usually attempts to be cooperative and generates logic that might not behave the same as what a simulation shows. Another possibility is that the result of synthesis fulfills the expected behavior most of the time, but it may fail randomly.

For example, consider this wrong design for a counter between 0 and 14:

reg [3:0] counter;
wire      reset_cnt;

assign reset_cnt = (counter == 15); // This is so wrong!

always @(posedge clk or posedge reset_cnt)
  if (reset_cnt)
    counter <= 0;
  else
    counter <= counter + 1;

The horrible mistake is to use @reset_cnt as an asynchronous reset.

But let's begin with explaining how this works on the simulation: @counter counts up on rising edges of @clk. But when @counter reaches the value 15, @reset_cnt changes to '1' and resets @counter back to zero asynchronously. Hence when @counter is sampled with @clk, it displays the values 0 to 14, as it should.

In hardware this might not work. The problem is that @reset_cnt is a combinatorial function of @counter. So when @counter changes value from 7 to 8, the logic that calculates @reset_cnt might see the value of @counter as 15 briefly. This is because 7 is 0111 in binary code, and 8 is encoded as 1000. So if bit 3 has the shortest propagation delay to the logic that calculates @reset_cnt, this signal might be '1' briefly. As a result, @counter will count from 0 to 14 sometimes, and sometimes it will count from 0 to 7. Temperature and other unrelated factors can have an influence on which one of the options is observed.

This explanation to why this example is wrong is greatly simplified however. The tools are free to implement combinatorial logic in the most creative ways, so virtually anything can happen between clock edges. The only thing that the tools guarantee is that the signals are stable in accordance with the timing requirements of the flip-flops at the destination (setup time and hold time).

So unless the logic design strictly follows the rules regarding RTL design, weird things can definitely happen.

Reason #5: Temperature and power supplies

This isn't a common reason for trouble, and it's easy to check. Nevertheless, temperature and power supplies can be the root cause for odd problems.

Quite naturally, if the silicon's temperature is outside the allowed range, nothing is guaranteed to work. The most common reasons are overheating due to insufficient thermal planning, or fans that struggle with dust.

As for power supplies, they might produce faulty output for various reasons. A simple check with an oscilloscope often reveals whether the voltage is within the specified range. Note however that the voltage should stay within this range at all times. It's not enough that the average voltage is correct: Neither the noise that switching power supplies always create, nor occasional spikes are allowed to exceed the limits.

Note that even though a spike that is no longer than 1 μs may seem innocent, it's tens to hundreds clock cycles inside the FPGA, so this is a significant period of time during which the FPGA is fed with an incorrect voltage. Preferably measure the voltage at the decoupling capacitors close to the FPGA to see the voltage that actually arrives. Also be sure to configure the oscilloscope's trigger at the upper and lower voltage limits, and be sure that the oscilloscope doesn't react on these. It's not easy to notice brief spikes on the oscilloscope's display, but the trigger will catch them.

Sometimes power supply issues are a direct result of poor board design. Many power supply modules have a minimal current that is often overlooked. If this minimal current isn't drained from the power supply module, it may become unstable and produce a voltage that doesn't meet its specification, or even worse, it might have occasional oscillations.

Another common mistake is to put a switching power supply where a voltage regulator is required. In particular, there are low-jitter clock oscillators that require a very clean input. If such oscillator is fed by a noisy power supply, this noise is reflected into jitter on the clock output. If a Gigabit transceiver uses this clock (for e.g. PCIe, USB 3.x, fiber optics etc.) this often results in an unreliable data link.

Likewise, when DDR memories are part of the design, a reference voltage power supply is required. This voltage is used by both the FPGA and the DDR memories as the voltage threshold between a '0' and '1' on the wires that go between these two components. If this voltage is generated by a switching power supply, odds are that the voltage supply noise will make it harder, or even impossible, to transmit data without errors between the FPGA and the DDR memories.

Reason #6: Are you kidding me?

Sometimes, the reason for the black magic situation is a flaw so big, that one wonders how anything worked at all. For example, when wiring on the PCB is completely disconnected from the relevant FPGA pin, and yet the correct signal reaches the FPGA by virtue of crosstalk or parasitic capacitance.

This tends to happen in particular with clocks, because they are often wired all over the board, and the fact that they are periodic signals improves their chances to reach the FPGA good enough to appear to be OK.

So by all means, take an oscilloscope and check all clocks as close as possible to the FPGA. If there's an AC coupling capacitor for the clock, it's a good place to check, in particular because you may discover that the capacitor is missing.

Reason #7: Plain bugs

Or more precisely: The design was never made to work. At no point had anyone sat down and figured out how the logic is ensured to do its work. Instead, the code was gradually written with trial and error, partly with simulations and partly with hardware. This process was finished when things appeared to work fine, but looking at the code it seems like a miracle that it worked at all: Because it has been patched so many times to fix just that little thing, it's impossible to follow what's going on, let alone make changes.

I've put this reason last, because it's not really a black magic behavior. It's just a very annoying bug. Nevertheless, this is the most common reason FPGA projects get stuck.

If you still think it's the FPGA's fault

Sometimes it's not your fault. There might be a bug in the FPGA itself or the vendor's software. This happens much less often than people tend to blame the FPGA's vendor, but on rare occasions, this is indeed the case.

Because of the natural temptation to blame someone else, do yourself a favor, and don't wrap up the exorcism session by blaming the FPGA, unless you have one of these two, or both:

An errata record from the vendor, which matches your situation exactly: Both the cause of the problem and the result. This might be a bit difficult to determine, because erratas tend to be purposely written in a vague manner, in particular to make the problem appear to be extremely specific and rare, and to play down the consequences. Don't get tempted to interpret a similar errata record as a match with your situation. There's always an errata record that is somehow similar to your case.
An indisputable smoking gun: If you can repeatedly expose a specific bug that explains exactly why your problem occurred. It's not enough to show that the tools or the FPGA itself behave senselessly. You need to narrow it down to a specific and repeatable logic pattern that is faulty.

If you do wrap up without any of these and manage to somehow work around the problem, there's a good chance you'll meet the problem again later.

The best example I have on a bug on the FPGA itself was a long time ago, on the hardware FIFO of Xilinx' Virtex-4. That is, a dual clock FIFO for which the control logic of the FIFO was implemented directly in silicon (i.e. not in logic fabric).

The data flow through that FIFO got stuck every now and then. After some investigation, it turned out that the FIFO held both its empty signal and full signal active at the same time, after working properly for a while. This is an illegal condition, unless the FIFO is held in reset, which it wasn't. So after making absolutely sure that I was observing the correct signals, I closed the case with the conclusion that there's a bug in

the FPGA's FIFO. And I went for FIFOs that are implemented in logic fabric instead.

A few months later, I found an errata record on these FIFOs which I wouldn't have understood unless I knew about the problem beforehand. But after reading the description very carefully, I could conclude that it confirmed my observation.

That was just an example of how obvious an FPGA's bug must be before it's OK to declare the problem as "not my fault".

Summary

When the FPGA appears to defy the laws of nature, it's tempting to adopt explanations that divert from common sense. It's nevertheless important to look for a rational explanation — and quite often this explanation can be found without the need for superpowers.

Hunting down the reason might however require a thorough review of the design, which isn't necessarily a bad thing. Frustrating as such a hunt may be, it may contribute significantly to the design's quality regardless.