Many weeks ago DARPA organized a summit at the Palace of Fine Arts in San Francisco. The first day consisted of a workshop and some other presentations, including one by Cadence's Tom Beckley. Since Tom's presentation was very similar to what he presented at CDNLive Japan (which I talked about in CDNLive Japan: The Fourth Industrial Revolution and the Third Dimension ) then I won't go over the same ground again. There was so much new material. John Hennessy The second day opened with John Hennessy's keynote. I arrived just as he started so I missed how he was introduced. He was on the PCAST Semiconductor Working Group, for one thing (see yesterday's post PCAST: The President's Council of Advisors on Science and Technology ). He just received (along with Dave Patterson) this year's Turing Award. He was President of Stanford for years. Now he is Chairman of Alphabet, the parent company that own Google, Waymo, and others. John talked about how general-purpose computer architecture has run out of steam, for different reasons, but around the same time as Moore's Law is running out of steam. We started with simple microprocessors that ran one instruction per clock. Then we moved into the instruction-level parallelism era, where the hardware hid what was going on and the programmer didn't have to worry. Next, due to power considerations, processor manufacturers switched to multicore, making the programmer responsible for identifying parallelism and encapsulating that in threads. However, Amdahl's law limits the effectiveness of this in most situations. For example, try running a program on 64 cores that has just 1% of serial code that cannot be parallelized. You will end up with just a 36X speedup, so you pay for 64 cores, but you get 36. Worse, as we ramp up the number of cores, we run into the problem of dark silicon: thermal power means we can't turn them all on at once. For example future 96-core processors running at 4.9GHz will dissipate 295W. If you have a more reasonable limit of only(!) 165W then only 54 cores can be active. There is not a lot of point in putting a lot of identical cores on a chip if you can't turn them all on. The only thing left is domain specific architectures (DSAs), sometimes just called accelerators. There are two reasons for this. One is simply that if you can't run all the cores at once, at least make them do different things. But the main reason is that computer architects have run out of ideas to speed up general-purpose code. The current way of doing things is very wasteful of power since all the circuitry required to schedule instructions and do branch prediction (and mis-prediction) uses about 20 times as much power as the instruction being scheduled. John had the same example, of a matrix multiply, that Dave Patterson talked about in his DAC keynote earlier this year (I think it came from the joint presentation that Hennessy and Patterson put together for when they received the Turing Award and had to make a speech). For more on Dave Patterson's keynote, see DAC Wednesday: Denali, Patterson on Architecture, Rowen on Deep Learning, Analog Reliability Lunch, Bagpipes ). Just switching from Python to C gives you a 47X speedup. Using multicore and parallel loops give another 9X (for a total so far of 366X). Optimizing memory handling gets you to 6,727X, and using x86 SIMD (vector) instructions gets you to 62,000 times faster than the original code. Of course, this is cheating a little since matrix multiply is a sort of best case. But John's point was to look at how much inefficiency is in the system just to make the code easy to write. John said that DSAs are not magic, but they make more effective use of parallelism for a specific domain, and more effective use of memory bandwidth. They can also move to SIMD, which is inherently more efficient. Indeed, the next day, Bill Dally, NVIDIA's Chief Scientist, would show that by doing "really big instructions like matrix multiply" the overhead of using a programmable processor was only 30%. What that means is that even if you built some specialized fancy hardware without the programmability, the most you can save is 30%. DSAs really can be that close to optimal. They can get even more optimal when you throw away unnecessary accuracy. It turns out most neural nets don't need 32-bit floating point, they are just fine with 8-bit fixed point. DSAs require targeting high-level operations to the architecture, instead of starting with C or Python and trying to recover them from the code. Things like matrix, sparse matrix, Tensorflow, and so on. There are some challenges writing the code at the right level to still retain architectural indepedence, which is obviously very desirable. One of the key insights of RISC (which is what Hennessy and Patterson invented) was that people were not going to write assembly code any more. It would all be high-level languages and compilers. A similar shift is going on here with domain-specific languages. People are not going to write C any more, they are going to write domain-specific code and run it on optimized architectures. John talked briefly about the PCAST Semiconductor Working Group. He said that the most important recommendation is run faster as the only way forward. Reinvent the future. Information technology is, today, the most important economic and security asset for any nation, combining hardware, software, and creativity to compete by running faster. William Chappell Dr William Chappell is the Director of the Microsystems Technology Office of DARPA. He is the head of the Electronic Resurgence Initiative. He presented the initiative shortly after John Hennessy's opening presentation. He called it a "love letter to Moore's Law." Indeed, the agenda itself is sprinkled with quotes from Gordon Moore, starting with this December 2017 one, which Bill opened his presentation with: Since my 1965 paper that ERI references, what has actually happened in the intervening 52 years is far beyond anything I contemplated. It is a testimony to the many engineers and scientists that the industry has surmounted apparent roadblocks that looked to be the end of transistor scaling. He said that we are at a unique moment in time with cost, abstraction, foreign investment, rising stakes. Computer Science departments are exploding, but EE departments are...not. They are flat. China is investing $150B which is a staggering number that has never been seen in the history of the world. [To put that in perspective, I just looked up the cost of the NASA moon missions, at it was $107B in 2016 dollars.] It is a clear attempt to move the center of the electronics eastwards [or, from San Francisco, westwards]. The US government has a particular problem, with no access to leading-edge processes for semiconductor fabrication and state-of-the-art assembly, due to "our own regulations." I'm assuming this means things like fanout 3D packaging. However, I assumed that they had access to 14nm at GlobalFoundries' Fab8 in New York, but apparently that is too entangled with Samsung since it is their process. Another problem that everyone from DARPA talks about is that in the military, designs are all about design cost. The production volume might be one per jet fighter, so if they can get the design cost way down even at the expense of an order of magnitude increase in manufacturing cost, that is a good deal. So they are very focused on design cost. In fact, Bill went back to Gordon Moore's paper which included the quote: Perhaps newly devised design automation procedures could translate from logic diagram to technological realization without any special engineering. Bill talked about some specific programs: N-ZERO is a program for intelligent but unattended sensors in the 10nW range. "Trillions of devices will not exist if we have trillions of batteries to charge." SHIELD: the art of the small (<100um 2 ). Less than 1¢ per chip. But with an anti-counterfeit perspective. ReImagine: reconfigurable imaging, multi-mission imaging in a single camera. L2M: Life-long Learning Machines. In-field learning, detecting surprises that were never seen during training. S2C: the spectrum collaboration challenge. This is a Grand Challenge. How would you impart edge intelligence without a spectrum manager, where all the devices have to just "get along". There is a bit of background on the issues here in my post from last year GOMAC: A Conference that Starts with the National Anthem . SSITH: System Security Integrated Through Hardware and Firmware. This is what it sounds like, combining hardware and software to construct systems that are secure against external software attack. Bill wrapped up pointing out that Moore's Law has been running for 50 years, but now we need to find the next exponential. He emphasized that "we have congressional support" and mentioned the PCAST report that I covered in PCAST: The President's Council of Advisors on Science and Technology . More Programs Following that were presentations on several more programs: CHIPS: Common Heterogeneous Integration and Intellectual Property Reuse Strategies. Since Cadence is the prime on this program, I will cover it in detail later in the week. Page 3: 3DSoC: Three-dimensional Monolithic System-on-Chip. Page 3: FRANC: Framework for Novel Compute. Page 3: IDEA and POSH: Intellligent Design of Electronic Assets, and Posh Open Source Hardware. IDEA is where the Cadence MAGESTIC program fits in, and David White presented it. For more details, see my post from that day Cadence is MAGESTIC , which includes some of what David presented (since I had access to the presentation the week before). Also part of IDEA, Andrew Kahng, of UCSD presented OpenROAD, Foundations and Realization of Open Access Design, which aims to put together a workable design flow using only open-source design tools. As it happens, back when I last worked for Cadence in the early 2000s, both Andrew and I were on the Cadence Technology Advisory Board (TAB), so I know him well. I will write about OpenROAD later in the week too. So look for more posts about the summit in the rest of the week. Sign up for Sunday Brunch, the weekly Breakfast Bytes email.
↧