Recently, I wrote about Robert Lang's presentation on Computational Origami . He was a real paper master. The CTO of AMD is Mark Papermaster, although I think he is more of a silicon master than a paper master. At SEMI's strategic materials conference, he gave the opening keynote, The Future of Semiconductors—Moore's Law Plus . At one level, there is an almost unlimited demand for compute power. This is driven by two things, the desire of users for better experiences, such as face recognition and VR, and secondly by datacenters where increased compute power means fewer servers are needed to deliver the same capability. But as everyone know by now, Moore's Law has not stopped from a technical point of view, but some aspects of it have slowed considerably. We can double the number of transistors from node to node but not at constant price. And we can build more and more cores into our processors, but not up the clock frequency much, due to power constraints. That has led to the graph above that Mark showed. We get more transistors, and lower power, but not lower power enough to keep the frequency gains coming. So the red dots go up, and the green dots flatline. Moore's Law+ We are now in what Mark calls the era of Moore's Law+, where we need to use other techniques to deliver what users require. There are still some legs on traditional Moore's Law, with the impending introduction of EUV, new transistor architectures (such as gate-all-around), new interconnect materials (ruthenium?), and so on. This looks like it will give us a route to 3nm and maybe a little further, before something radically different will be required. But there are major cost challenges: mask costs are increasing 2-3X (50 masks at 14-16nm going up to 80 masks at 7nm without EUV). Die size is also increasing, in AMD's market for CPUs and GPUs, anyway (it seems to be decreasing in mobile looking at Apple's recent A11). GPUs, in particular, are approaching the 800mm 2 reticle size. However, big dies bring their own yield problems. AMD's recent EPYC processor is actually a multi-chip module. Instead of containing a single 32-core die, it contains four 8-core dies—the boxes labeled CCX are a core-complex with four cores inside. This has a theoretical cost reduction of 41%. It is theoretical since AMD never built the big 8-core die, but fab yield models are pretty good. This is an example of using advanced packaging to extend Moore's Law (often called by the catchy name More than Moore). Mark predicts more of this with chiplets available for package-level systems, stacked dies (especially high-bandwidth-memory) and increased densities of interconnect as uBumps and flipchip technologies improve. As Mark put it: What used to be board-level integration is moving into the package One key technology to make this feasible is the interconnect fabric. AMD's Infinity Fabric can be used within a package, or for a CPU to communicate with a GPU, or for multiple CPUs in a cluster to communicate. This makes it possible to scale from chiplets to solutions involving many chips with the same silicon. Memory In my post about SEMI Strategic Materials Conference I said that Dave Hemker of Lam research was kept awake at night by DRAM. It is reaching the point where scaling is getting really hard, and there is no equivalent 3D solution in sight to vertical NAND flash. Mark is clearly having sleepless nights, too. DRAM density improvements are slowing, and the imminent requirement for double patterning is likely to kick costs up. Meanwhile, in his end markets, server system DRAM requirements are increasing. There are some new memory technologies on the horizon, such as magnetic (MRAM), resistive (RRAM), and phase-change. Memory dies can be stacked (HBM2 being the current incarnation). These higher levels of integration also give higher performance and lower power. DIMMs are around 8pJ/bit with a bandwidth of 50-200GB/s. At 2.5D, putting the processor on an interposer, next to a stack of HBM memories gets down to about 3.5pJ/bit with a bandwidth of 250-1000GB/s, and full 3D, stacking all the memory dies on top of the processor dies, gets down to less than 1pJ/bit and a bandwidth of up to 1TB/s. I'm not sure how the thermal challenges work for this last arrangement, the "hot" processor core is covered up by the cool DRAM cores—I've seen presentations by IBM in the past that reckon you have to put the processor on top of the memory so that you can heatsink it (but that means the processor connections have to drive through the memory dies and so they can't be standard HBM dies). On the subject of bandwidth, not just for memory, silicon photonics is coming. It is already here at the level of boxes in the datacenter, it will be on boards, and eventually inside the packages. Software and Standards The final element for Moore’s Law Plus is software. Two big ones for AMD are: HAS (heterogeneous system architecture foundation) where AMD is a founder member Radeon open computer (ROCm), with support for x86, Arm, POWER, and existing CUDA applications Also open interconnect standards for heterogeneous accelerators (protocols and firmware stacks), such as CCIX (aka see-six) Summary Putting it all together gives a breakthrough in performance. This system, based around the EPYC chips discussed earlier, delivers 1 petaflop in a single rack at full single precisions. At half precision, it is 2 PFLOPS. Previously the Department of Energy would require a whole room of equipment to get this kind of performance. So the big three Moore's Law Plus enablers are: Chiplets and multi-chip modules Heterogeneous accelerated computer Open ecosystem (software and communication technologies) Sign up for Sunday Brunch, the weekly Breakfast Bytes email.
↧