Quantcast
Channel: Cadence Blogs
Viewing all articles
Browse latest Browse all 6658

The Linley Halloween Processor Conference

$
0
0
Halloween saw the latest Linley Fall Processor Conference. It is only a few years ago that processors were fairly boring. Intel and Arm would regularly come out with new designs, and some specialized processor families like Tensilica would update their offerings. I think that three things have changed. One is the end of Moore's Law, at least for general purpose processors. Yes, you can add more cores, but outside of cloud servers running a mixed workload, there is little to do with them. The second is the end of the general purpose processor roadmap. For years, processor performance improved at 40% per year, then 20% per year. Now maybe 2%. We've added every architectural innovation we know of (out-of-order execution, branch prediction, deep pipelines, caches, speculative execution). It is also horribly inefficient, consuming almost all the energy to do all the management of which instructions get executed, and very little on actually executing them. The result of those two things is that future developments in processors are going to be in specialized architectures. It does no good to add another new core that does the same as all the other ones, but adding a core that does something special perhaps a dozen or even a hundred times more efficiently than a general purpose processor is attractive. Finally, the third thing is the explosion of interest in deep learning, artificial intelligence, and neural networks. In the last five years or so, neural networks have been discovered to "work" for a wide range of applications such as image recognition, voice recognition, and some aspects of ADAS for automotive. On its own, this wouldn't drive a lot of interest in processors, but it turns out that general purpose processors are terrible at running neural networks. The basic assumptions of general purpose processors that you don't know what is going to happen next so the hardware has to schedule it, and caches make it all workable, are wrong for neural networks. The order of the instructions, and which ones can be run together, can be done statically, and caches just consume energy without providing any speedup. The result of this is two-fold. General purpose processors are fighting a rearguard action by adding special instructions for neural network evaluation, since it is easy to get some speedup. But more significantly, there are new specialized processors being created. Where do you announce a processor like that? At the Linley Processor Conference. Result: this was the most heavily attended Processor Conference ever, with every seat in the room taken and people standing at the back. Cadence took advantage of the occasion to announce their latest audio processor, the Tensilica HiFi 5. I wrote a post about it that morning: "Alexa, What Is HiFi 5?" Linley Gwenapp's Keynote Linley titled his keynote Breaking New Bottlenecks in Processor Design . He opened making some of the points I made above, before diving down into the details of just why general purpose processors are a poor match for AI, and how specialized systolic architectures achieve better results without sacrificing programmability. If you've heard the word "systolic" outside of this context, you probably know it is something to do with the heart. Indeed, that is where the name comes from. On each "heartbeat" data is streamed from one unit to another, without needing to make a round-trip out of the processor and back. Evaluating neural networks is largely a matter of doing a very large number of multiply-accumulate (MAC) operations using weight data. A general purpose processor might have 48 execution units, a specialized neural network processor might have thousands. The Google TPU, which was the topic for the second day's keynote and which I'll cover next week, has a systolic 256x256 array matrix multiplier...and that is the v1 version of the TPU and they are now up to v3. Security is another area that has become a performance bottleneck. Spectre and Meltdown (and the more recent Foreshadow) use features of speculative execution to steal data in one process from other processes or the operating system. For a take by the experts from this summer, see my post from HOT CHIPS Spectre/Meltdown and What It Means for Future Design . I won't go into the details here, but defending against Spectre, Meltdown, and Foreshadow has a performance impact. You can't "just turn off speculative execution" since that would reduce processor performance to about 5% of what it is currently. Mitigation means turning some aspects of speculation off sometimes. The table on the left shows the impact with several benchmarks, with the impact being as much as 30%. As one of the panelists said at HOT CHIPS, this has come along just at the point that Moore's Law and computer architecture have run out of steam, so losing 30% is a loss we won't get back just by waiting another year. I've pointed out many times before that inference is moving to the client. In my post Bagels and Brains: SEMI's Artificial Intelligence Breakfast Arm's Steve Roddy gave three big reasons why: Laws of physics, that says you can't move all the bytes through the internet with the required low latencies. Laws of economics that says that consumer platforms are where the economies of scale are (8B smartphones times 8 cores gives 64B high-performance CPUs). Law of the land: privacy laws getting stronger year-by-year, keeping user data on the user's device eases compliance. In practice, as Linley pointed out, SoC designers typically use specialized AI IP cores from companies like Cadence Tensilica, since SoC designers lack AI expertise. Not to mention all the usual differentiation arguments, that if you are going to design the same core as you can license, then why bother? Memory bandwidth for AI processors is important. Indeed, it is not hard to put so many MACs onto an SoC as to max out the memory bandwidth. At the high end some approaches are using specialized memory like GDDR and HBM. Automotive Next, Linley moved on to automotive. There are many partnerships to share the cost, such as Aptiv–Lyft, Audi–Huawei, Baidu–Daimler, Honda–GM/Cruise, Toyota–Uber, Waymo–Fiat/Chrysler. A lot of development is going on using NVIDIA GPUs. Some are developing custom ASICs, which is obviously an opportunity for IP vendors. In Linley's opinion, Intel/MobileEye is now two years behind, with the 12 TOPS EyeQ5 now scheduled for 2021 production. I don't know if this is connected to Intel's 10nm delays. By comparison, NVIDIA's Xavier delivers 30 TOPS (in 30W) and their future 4 chip Pegasus (nice name for a physical verification suite) promises 320 TOPS (in 400W). It goes without saying that functional safety is important in automotive. I'm sure you've heard the magic number 26262 before. Another big issue is that the industry is undecided on centralized versus distributed systems. There are obviously nuances, but centralized means that you put a powerful processing unit at the center and feed raw data from all the sensors. Distributed means that each sensor has its data pre-processed and the processed data is forwarded to a less powerful central decision unit. The central approach obviously requires much more power in the central processing unit, and higher bandwidth networks, but there are potential advantages from merging together all the raw data without having lost any information due to pre-processing. IoT IoT today is dominated today by industrial, predominately smart buildings and smart meters. See the pie chart to the right. However, Linley forecasts that will change as costs drop. The consumer market will grow. Voice (Alexa, Siri etc) provides an entry, with Alexa being built into door locks, lighting, and thermostats. By 2023 the market will split 74% consumer and 26% industrial. Datacenter Next was datacenter. I talked recently about datacenter networking in my post The World's First Working 7nm 112G Long Reach SerDes Silicon and Linley covered much of the same ground. RISC-V Next up was RISC-V. Linley pointed out that many chip startups have announced RISC-V based products but none are shipping. The RISC-V Summit is coming up at the start of December (see my preview post RISC-V Summit Preview: Pascal or Linux? ). I will attend so I'll cover the RISC-V ecosystem next month. Summary With that, Linely wrapped up. His summary points: AI acceleration is spreading from the data center to phones and IoT ADAS and self-driving cars create a big processor opportunity AI acceleration is moving to new specialized architectures Accelerators save power even if CPUs and GPUs can also do the work IP cores for AI acceleration simplify integration into ASICs and SoCs Industrial IoT is strong today, but consumer will be the largest market Smart NICs enable heterogeneous computing in data center Open-source cores are gaining ground but still lag licensable cores I'm not going to attempt to cover everything that was presented in the rest of the two days. If you are interested in everything then you should attend. But I will cover the Google TPU keynote. This is the most significant AI hardware project if only because it is one of the few that is deployed at scale. There were two talks about memory from Rambus and Micron that were also especially interesting in light of the heavy memory demands of AI. Two other presentations I'll cover here were Wave Computing and Arm's automotive solution. Wave Computing Two years ago I covered Wave Computing at the same conference in Wave Computing: a Dataflow Processor for Deep Learning . They described their DPU (dataflow processing unit). Fadi Azhari said about this chip: We’ve seen silicon, it is in testing, in testing at customers in systems actually. It is in 16nm, runs at 6Ghz, and contains 16,000 processing elements. The other big thing about Wave Computing is that the acquired MIPS. The story is complicated: Imagination acquired MIPS, then they lost Apple as a customer and decided to focus on their GPU business but couldn't find a buyer for MIPS. But Imagination was acquired by private equity firm Canyon Bridge (sounds like an Intel processor) but they were too Chinese and so MIPS was sold to Tallwood, and now to Wave. Got all that? There will be a test later. They will continue to run the MIPS IP business as an IP business but they also depend heavily on mixing MIPS processors and their DPU technology to build chips for edge-of-cloud applications and for on-device inference. The diagram above shows the DPU roadmap. For the next generation at 7nm they have partnered with Broadcom to do the physical design. I'm not sure if that is a "know your limitations" or whether there is a stronger business arrangement. They also plan a smaller chip for edge/fog applications, with just 2,000 processing elements. Obligatory finger for size, although these days you can fit so many gates on a square mm at 7nm, that fingers aren't very informative. Tantalizingly, he finished by saying: Partners are enabled…stay tuned for some announcements in the coming weeks. Arm Automotive Arm talked about their already announced (I believe) Cortex-A76AE. Govind Wathan talked about Enabling a Mixed Safety-Criticality Future . The A76AE is a multicore processor. The individual cores are paired up, but each pair can either run independently, or be put into lockstep mode. The above diagram shows the basic idea. In lockstep mode, the outputs of the two processors are continuously compared and if there is any different, indicating some sort of random failure of one of the processors, then an interrrupt is generated. What is done with that interrupt remains a little unclear, since with two processors it is usually not clear which is the rogue and which is good. In any case, Arm "does not dictate how recovery of a fault is managed, but typically it could be an automatic reset, plus reporting capabilities." I am a little skeptical of this whole approach, purely on the basis that complexity is the enemy of safety. Having lots of cores that might be running in lockstep and might not be, and potentially even being switched between those two modes, seems like yet one more area that might fail and require a lot of functional safety analysis. The more common approach, I believe, is to have a big fast multicore processor, and a completely separate safety processor that is small, built very carefully: triply redundant, physically spread across the chip, and so on. But Arm is big, smart, and influential, so this approach might become widespread. Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

Viewing all articles
Browse latest Browse all 6658

Trending Articles