Quantcast
Channel: Cadence Blogs
Viewing all articles
Browse latest Browse all 6664

We Live on a Radioactive Planet Bombarded by Cosmic Rays

$
0
0
We live on a radioactive planet bombarded by cosmic rays. As a result, high-energy neutrons are a fact of life. This can lead to errors caused by single-event effects, or usually just SEE. SEE cause unpredictable system behavior and threaten safety and reliability. SEE generally occur from nuclear decay or atmospheric particles accelerated towards the earth by cosmic rays. But they can also be caused by radioactive decay of packaging materials, or stray atoms of impurity left on the die, where the energy is a lot lower but the distances are very short. These are known as thermal neutrons, although there are also alpha particles (helium nuclei) that result from radioactive decay. Often as we move down the process node treadmill, new challenges appear that we didn't really have to worry about before. Often, these challenges require addressing at a number of different levels: the process, the cell libraries, the design, the EDA tools that we use. One well-known example of this type of problem is metal migration. If a current is too high through a metal wire that is too narrow then the current actually moves the metallic atoms. This creates a narrower neck, which is a positive feedback that makes the problem worse. Eventually the metal neck gets too narrow and opens completely, and the chip fails. We address this at many levels. At the process level, we design metal to be able to carry a high current (except in DRAMs where we do everything we can to keep the metal cost down). At the design level, we need to make sure that we do current analysis. We need EDA tools to perform the analysis and allow us to address hot spots. Each process node typically makes the problem worse. For example, at 16nm, did you know that a large buffer is no longer in spec if it drives minimum width metal? Metal migration used to be purely a power and clock problem, not a signal problem, too. Another problem like this that is starting to become a real issue is designing systems that are resilient in the face of soft errors caused by radiation leading to SEE. SEE cause unpredictable system behavior and threaten safety and reliability. There can also be effects that damage the chip in a way that it will not recover without powering the chip off or, perhaps, ever: single event latchup, single event gate rupture, and single event burnout. The first companies to take this problem seriously were the high-end networking and server vendors. You may not like your smartphone crashing, but companies like Cisco and their customers really hate big routers going down. They were the companies that first came up with requirements in this area, and they get pushed down the semiconductor manufacturing stack. The automotive requirements in ISO 26262 mean that systems have to be more reliable than the underlying silicon is capable of delivering, and SEE are one of the big issues that have to be addressed. It is okay if your mobile phone reboots; your ABS braking cannot. The problem needs to be addressed at multiple levels like the metal migration issue. The materials used in manufacture need to be analyzed, not just in the fab but also packaging material, bumps, solder. But cosmic rays are a fact of life, so even with the best materials there is still a risk of SEE. How big a risk is affected by design of the cells (flops and memories that can be flipped into the wrong state) and by the layout of the design itself. I have seen a theory that many of the random "blue screen of death" occurrences on early PCs were not software errors but SEE on chips where nobody was even aware of it being an issue. Famously, there was even a lot of trace uranium in semiconductor packages back in the early '80s. Just as with metal migration, which we can accelerate by raising the temperature, we can analyze a product by putting it in a more radioactive environment, in particular, by bombarding it with accelerated neutrons. That immdiately raises the question of how do you accelerate neutrons since they don't respond to electric fields? (A neutron walks into a bar. How much for a beer? For you, no charge.) The answer is that you accelerate deuterium nuclei, smash them into a metal plate and a lot of them come apart, you can deflect the remaining protons with an electric field, and you are left with just the neutrons. However, while testing real systems in neutron beams is great for in-depth reliability analysis, it is pretty useless during the design phase where we need tools to analyze the problem before tapeout and manufacture, when we can still do something about it. At one level, this is a problem that gets worse from one process generation to the next. But not quite in the way that you think. The same design implemented in a more advanced process is more resistant to SEE (see the above graph). That's because it is smaller physically, both at the level of a cell like a flipflop, and also at the level of a whole die. So with the same particle flux it is less likely to get hit. Memories, though, suffer from the fact that when they do get hit, then more than one bit can be affected (see the diagram at the start of the post). It is important to understand how the ECC used interacts with adjacent multi-bit failures. In practice, the problem does get worse from one process generation to the next because companies do not implement the same design at a smaller die size. the die size remains roughly the same and more functionality is crammed in. So there are more flops and more memory bits to be hit, even if the probability of any individual one getting an error is declining. SOI processes, in advanced nodes most notably FD-SOI, are more resistant it seems. If you are a designer, especially if your designs go into products that require high reliability (medical, automotive, internet infrastructure, etc) then you need to start to worry about the possibility of SEE. The end customers (automotive companies, cloud infrastructure companies, router and basestation companies, etc) are starting to have specifications for SEE reliability that will then get driven down into the supply chain. Even the most casual observer of the semiconductor industry has now heard of ISO 26262, even if they don't understand the details. I attended IRPS in Monterey and will write some future posts on reliability. This post is the introductory background reading.

Viewing all articles
Browse latest Browse all 6664

Trending Articles