Episode V: WID & D2D When bad things happen to good chips
Next week begins the first in a lengthy string of conferences – 7 to be exact, over the course of the next 8 weeks. Short of DATE in Munich, most of these conferences will be taking place in and around Silicon Valley. Over the next 8 weeks, there are going to be enough panels, keynotes, sessions, tutorials, exhibits, demonstrations, press conferences, backpacks, proceedings, and PowerPoint slides to sink a ship. If you like to learn stuff, this upcoming February and March will be the best 8 weeks of your life. If however, you tend to suffer from sensory overload, these next several months will be the very worst weeks of your life. So, do you have a game plan for survival? Do you have a strategy for not being overwhelmed by it all? Most importantly, do you know how to manage the Meeting Monster so that you'll actually end up knowing more in 8 weeks than you do right now? I only throw out these questions – fraught with anxiety and/or opportunity as they may be – because this article is about a conference that happened way back last November, and a phone call with iRoC's Michael Buehler-Garcia and Damien Chardonnereau that happened just this week. Never mind what I'll have learned 8 weeks from now, I'm still looking back trying to grapple with what I may have learned 8 weeks ago – specifically at IEEE's ICCAD in November in San Jose – and what I learned during the iRoC phone call. **************************** Cosmic Rays are not our friends ... Michael Buehler-Garcia is the Vice President, Worldwide Marketing and Business Development and The iRoC message is that with the introduction of nanometer technologies, "new types of faults are appearing and existing EDA tools are reaching limits in their capacity to predict the physical behavior of new semiconductor technologies." – and all of this is making chips more and more vulnerable to "internal defect and external aggressions." This is pretty interesting stuff – and struck me as closely linked to what I heard last November at ICCAD. So here's what I learned during a Q&A with Michel and Damien. By the way, Star Trek – The Movie was the only way to prepare for this interview. Q – What are external aggressions? Michael & Damien – At iRoC, we see external aggressions as those outside forces, environmental actions that are outside of the chip environment, that impact the in-field reliability of semiconductor chip designs and performance. To compare and contract – Classic reliability is a gate failing due to a manufacturing design, but that's an internal problem. External aggression, on the other hand, is a bit flip that happens due to an alpha particle from a package or a neutron strike from a cosmic ray. These days the package is less of a problem than it has been in the past as it's being made with cleaner materials. Packaging materials used to have a lot of boron – BPSG, which stand for Boron Phosphorous Silicon Glass, was used up until around 2000. But, it was discovered by people at TI that impurities from boron packaging materials were bombarding the chip with alpha particles, and causing soft errors. Actually, even the soldering compounds can be a source of problems in a chip. When BPSG was moved out of common usage, alpha particles weren't so much the problem – then neutron strikes became the dominant factor in soft errors. Cosmic rays from solar flares have been here forever, but now they're becoming an 'external force' impacting reliability of chips. The message here is that you can talk about a lot of reliability issues having to do with fabs, quality assurance managers, and engineers who work to removed those issues, but external aggression issues are in a completely different category. Q: What's a soft error? Michael & Damien – A soft error is an error that's caused by one of these particles. It's not like a defect that stays around and damages a chip, things like defects that cause physical damage to a gate or a transistor. A soft error damages the state of the transistor at that moment, up to and including the latch-up. With a soft error, when you power down and back up, the error is gone. Q: How can people in the chip business predict how these particle strikes are going to impact their products? Michael & Damien – Well there's the hard way, but we're trying to propose an easier way. The hard way is that you develop a nuclear database. There are one or two military organizations and some industrial consortia that have done that. Then, you translate that data into TCAD models and run a SPICE-level simulation analysis on your chip. But that all takes a long time. At iRoC, we're proposing developing models for soft errors that would work at synthesis speeds, and that could be utilized at the design level. We have access to various databases, we've developed statistical models, and we've developed actual test chips to get to this point. We're taking our wafers and testing them in our neutron beam shuttles. We making these services available to our customers are well. Q: What are neutron beam shuttles? Michael & Damien – There are approximately 10 neutron beam shuttles in the world, most of them linked to various kinds of nuclear research. We're partnering with the ones at Los Alamos, at TRIUMF in Vancouver, at TSL in Sweden, and at Delft in The Netherlands. Our customers can have us test their wafers in one or more of these neutron beam facilities. The reason we've partnered with four different locations is that each one of them is characterized by a different neutron beam flux. For instance, the beam at Los Alamos is a full-spectrum beam, where as the beam at TSL is more characterized by the energy level of that beam. It's a mono-energy beam that lets you determine at what energy level the gate will flip. If you've got a soft error on your chip because it's been struck by a cosmic ray, you may not care at what energy level that error occurred. However, if you know the energy level that's problematic for your product, you can decide to only place your device in an operational environment where there won't be problems. Q: So beyond these neutron shuttle options that you're making available to customers, what's iRoC providing by way of design tools? Michael & Damien – So, what we're releasing this week is a product that represent our commitment to developing math models that can be used by designers during the design phase of developing their product. It's a software platform where the designers can do the interactions using the models, and then look at the results to help make decisions. Already designers today look at power, speed, and area trade-offs in sorting out a design. And, of course, at 130 nanometers, they're also looking at yield tradeoffs along with the power, speed, and area. What we're suggesting is that there's now a 5th element is design and that's reliability. This is going to be one of the most important considerations in design going forward. Q: So I know that you've got some kind of research operations going on in the Alps. What's that all about? Michael & Damien – On the top of Jungfrau in Switzerland – besides our engineers having a chance to do a lot of skiing, they're also looking at the normal radiation that's impacting on chips due to cosmic rays. The neutron beam at Los Alamos, for instance, is giving you approximately 10 million times the normal, naturally occurring radiation that you find in the environment. It's important to look at the results of that beam, but it's creating failures that aren't realistic for the normal life of a realistic device. So the concept at Jungfrau is to use the exposure to cosmic rays at an elevation that's 11 times as intense as what you would observe at sea level. The results, for instance, of 10 minutes in the beam at Los Alamos for a device is about the same as for 6 months of exposure to cosmic rays that occurs on the top of Jungfrau. So far, we've done the experiment on three different devices and we're continuing to work on producing more data. Q: Isn't there some iRoC work also going on in a tunnel beneath the Alps? Michael & Damien – What we're doing in the tunnel is to compare the results in that environment with the results on the top of Jungfrau. In the tunnel, you take the neutron flux down to a hundred thousandth of what the exposure is as see level. So in that environment, all the soft errors really are only due to alpha particles and not to neutrons. The experiments in the tunnel have an even longer time frame than on top of the mountain. We can see some statistically significant results after 3 months on Jungfrau, but it can take over 6 months to get statistically significant results in the tunnel. The whole idea here is to develop data for modeling that's about post-design reliability in the chip. The thought is to come up with more rapid tests and to find applications for your chips where these soft errors won't matter. Soft errors are going to occur – but you can choose to locate your chips where they won't impact reliability. Our solutions could have the hardware flag the software and say, "I've suffered an attack here." Then the software that runs across the device could offer up an answer to the transient problem. You can play off between area and speed on the device to develop crucial soft error protection. Q: Do you find this whole subject to be pretty tough for people to get a handle on? Michael & Damien – Yes, this is very complex, and it's really tough for a lot of people to understand. Michael – When I was at PDF Solutions, I struggled to explain errors that were at the nanometer level. Now I've come to a company where the explanation challenges are even worse. Now we're working on problems at the nuclear level and relating them to the RTL part of the design – that's basically what we're trying to do here at iRoC. [Laughing] Of course, as I've told you in the past, certainly my background as an engineering graduate of ASU – Arizona State University – helps to explain my own personal success in explaining all of this to customers! Q: Do you think that engineering students in school today should be moving even closer to the physics of semiconductors and transistors, rather than moving farther away as some people say they are? Michael & Damien – It depends on the type of designer you're talking about. We tend to use the term designer in a generic sense. In the fabless world, for instance, it's the designer who understands the chip. But at IDMs, designers may be the people who are building the chip and then telling the systems guys how to use it. We need to distinguish between system-level designers and chip-level designers who are focused on the physical implementation of the chip. That's the person who needs to understand the implementation, about the physics of the device, and the yield issues in semiconductors. The EDA industry is always looking for ways to expand. Companies like iRoC and PDF solutions are companies that don't build chips, but try to make the chips better that other people are building. This is where the EDA industry should be looking to expand their business opportunities. It's a whole new field and the size of the market hasn't even been defined yet. We're talking here about wafer starts, wafer teams, even some insurance-level types of risks that companies who build chips are starting to look at. We're involved in all of that. We believe that if you've got mission critical data on your chip, then the stuff we're talking about here is totally a matter of the reliability of your device when a cosmic ray strike occurs. [Editor's Note: Beam me up, Scotty!] **************************** Nature itself is not our friend ... Mid-morning on November 10th there was a session at ICCAD that addressed variability in chips, how to predict that variability, and what to do about it. The operative terms there were: WID – variations within the die and D2D – variations from die to die These are not complex concepts. Mounds of traditional EDA marketing collateral notwithstanding, the universe is an imperfect place, and although you may think you're laying down layer upon layer of flawless silicon in a wafer – short of an intentional impurity here and there – and row upon row of flawless transistors, gates, vias, and interconnects – the fact is that nature is not your friend. Irregularities creep into even the most carefully assembled batch of silicon and circuitry, and just like cars and spouses, every once in a while a lemon comes rolling off the assembly line at the other end. The question, therefore, is how to spot the lemons and – better yet – how to predict the lemons before they're even manufactured. For advanced students, the additional question is how to repair and salvage a lemon if you've been so clever as to spot it as it tries to blend innocuously into the crowd. One of the presentations made on the morning of November 10th at ICCAD was titled: "Process and Environmental Variation Impacts on ASIC Timing" and was authored by Paul Zuchowski, Peter Habitz, Jerry Hayes, and Jeffery Oppold of the IBM Systems & Technology Group. This 30-minute presentation distinguished between systematic variations and random variations, and discussed the timing impact in ASICs of those different types of variations. The presentation suggested that not all types of variations are equal, and that modeling the systemic variations on a chip is to be recommended in advance of manufacturing whenever possible and/or practical. The IBM team lamented, however, that the existing circuit characterizations and static timing analysis (STA) flows are inadequate to the task at hand. The authors said that nasty, unwanted systematic variables might correlate to a die, a waver, or even (heaven forbid) to an entire wafer lot Such variables might include things like mean channel length or mean threshold voltages across the entity under consideration. Alternatively, nasty, unwanted random variables might include things like a specific channel length, a specific threshold voltage, or a specific level of metal resistivity at a specific location on a specific die. Random variables are all about WID; systematic variables are all about D2D. None of these variables are people your kids are going to want to bring home to meet the family. Having set the stage, the IBM team then described a possible strategy for characterizing both systematic and random variations, WID and D2D. Their suggestions were pretty complex, but the audience seemed to think it was pretty spellbinding stuff. Following the IBM presentation, a guy from Intel also had 30 minutes at the open mike. Samie Samaan from Intels' Desktop Products Group, Circuit Technology, gave a presentation titled: "The Impact of Device Parameter Variations on the Frequency and Performance of VLSI Chips." His info and conclusions went something like this: Parameters vary from die to die within the same wafer. Parameters also vary from lot-to-lot. Within the die, parameters vary from transistor to transistor, or from one metal wire to another. Meanwhile, silicon-inherent variations can be partly systematic and partly random – both D2D and WID. And, such variations (heaven forbid) can persist throughout the lifetime of the chip/package/product. Samaan offered that there might be design strategies that could get these kinds of nightmare under control – especially those design-phase related problems linked to tool inaccuracies, designer errors, node coupling, temperature, or Vdd. Samaan said that D2D variations are potentially manageable, particularly as next-generation process nodes mature. Here's a direct quote: "Even if control is not good, timing validation, micro-architecture, & circuit design could lower impact of process extremes." He also noted that random WID variations can be "small with a manageable impact on frequency" The last presentation in that morning sessions at ICCAD was titled: "Variability in Sub-100nm SRAM Designs" and was authored by Ray Heald and Ping Wang from Sun Microsystems. These guys asked – What's variable on DSM SRAM designs? – and then they told us. Line edge roughness, transistor widths and lengths due to optical and etch limitations. Expounding on the bad news, they said, "The time available for sensing a memory cell's state is usually less than that necessary to achieve full logic level swing, conclusions: variation in device characteristics is becoming an acute problem from SRAM design. Circuit-wide process variations of 2 to 3 are typical and will not decrease with future technologies. Within the die, local mismatch parameter variations are not decreasing as quickly as the parameters themselves. The number of SRAM cells and other circuit blocks which must work on a chip is growing rapidly. Traditional mean and corner simulations must be supplemented with a number of statistical simulations to assure circuits will work and yield at the expected speed and reliability. Repair of slow cells must be made possible by design." Did you get that last part? "Repair … must be made possible by design." **************************** So I think you see where this is leading to. Everything that the fellows from IBM, Intel, Sun, and iRoC are talking about – all of it represents the overlap of the reality of the natural world with the idealism of design, the theoretical nature of material science and the foibles of human error. All of this leads to what will be definitely the uber-topic over the next 8 weeks at: DesignCon The uber-topic that should be the focus of your strategy as you go from conference to conference and presentation to presentation, is the tools and methodologies that can address flaws and variations – be they transient, systematic, random, statistical, probabilistic, or otherwise. Dealing with those variations – the difference between what the designer thinks the design will be doing and what the design really is doing once it's been manufactured and becomes operational – is what's going to make or break everybody in this business going forward as geometrics move from 90 nanometers to 65 nanometers and beyond. If you're listening over the next 8 weeks, you'll be hearing all about it! **************************** Coming soon – Episode VI: DFM Tools Strike Back! ****************************
Peggy Aycinena owns and operates EDA Confidential. She can be reached at peggy@aycinena.com
|