In yesterday's post I gave the details of the TPU hardware from Cliff Young's keynote on the second day of the recent Linley Processor Conference. Today, I'll look at what Cliff said about software and co-design, creating both the hardware and the software as a conceptual whole. Codesign Cliff started talking about instruction set architecture (ISA) which is a contract between the hardware and the software. A classic tradeoff from the 90s was whether the scheduling should be done in the compiler (at compile time, obviously) or in the hardware (at runtime). The compiler approach led to VLIW machines like Itanium, the hardware approach led to out-of-order (OoO) execution. The answer really turned out to be both: build OoO hardware but also do careful instruction scheduling. The codesign debate ranged for 15-20 years, and that was just a single interface. Today, we have domain-specific areas. We probably want a compiler. We probably want libraries since the people who build the machines know best how to use them. We need to codesign from the physics all the way up to the applications, which is hard. Cliff said that there are three big codesign themes in TPU design: Systolic arrays. Reduced-precision numerics. Pods as tiled architectures at datacenter scale. Cliff said that systolic arrays are a two-dimensional generalization of a pipeline. They have entirely local communication "which is great, since we can't even get through 1mm of interconnect within one clock cycle." The basic systolic technology was invented in the 1970s but "they only got to have 8 ALUs, whereas today we have 64K. It just wasn't a good match for the implementation technology of the time." Like a heartbeat, systolic arrays have alternating computation and communication phases. Systolic arrays go a long way to address computing's energy problem. An 8-bit add is just 0.03pJ in 45nm, yet an add on a general purpose CPU takes 70 pJ, which is 2,000 times as much. I pulled Cliff's discussion on reduced precision numerics, in particular, bfloat16, into yesterday's post about the hardware implementation in TPU v2. But he emphasized that bfloat16 is a good example of codesign, requiring support all through the system from the hardware, through the compiler, libraries, and perhaps up into the neural network frameworks. Pods require TPUs to be designed from the beginning to be networked. The TPUs are tiled within the chips but those chips are then tiled to build supercomputers. The big challenge is data parallelism, taking the weights and sharing them across multiple TPUs, which can kill you if you get the interconnect wrong. There is an upcoming paper at NIPS 2018 discussing this, with the goal of handling billion and trillion parameter models, exploiting the fact that there are 4TB of HBM per pod. Then run the same single-program-multiple-data (SPMD) on each core. The approach apparently shows accuracy improvement. Performance There is a challenge measuring and comparing performance. Benchmarking is not where it needs to be. As Cliff put it, people tend to say something like: Hey I got a resnet-50, might not be the same code as yours but we’ll call it resnet-50 anyway, and here are our numbers. Machine learning performance needs more, and need reproducible benchmarks via open-source implementations. DAWNBench was proposed by Stanford as both a benchmark and a competition, with new results in April. It covers both training and the cost of training. The Cloud TPU on one benchmark can train in about 8 hours (not uncommon to be multiple GPU weeks ) at a cost of under $50 by the DAWNBench cost metric. That has improved over the last 6 months, and is now down to $35 (some due to improved software, some due to lower rental costs). For pre-emptible cost it is $11. Pre-emptible cost means you might get thrown off the TPU if someone higher priority comes along and needs it. That's nice, but as Cliff put it: That the same hardware can get twice as fast in a 6 month period means we haven’t quite worked out how to best use our machines. The pod numbers are very fast. A single TPU v2 can train ResNet-50 at 3250 images per second, for a full training time of 9 hours and a coast of $59. That is down to just over 8 hours today, for as little as $13 pre-emptible cost. On a TPU v2 half-pod, it can process 77,392 images/second, for a training time of just 30 minutes (24 minutes without checkpointing). With some clever tricks that have come along since the DAWNBench competition, they have got ResNet training down on a TPU v2 from $40 to $17, and with pre-emptible pricing down to $5. MLPerf MLPerf is a broad ML benchmark suite for measuring performance of ML frameworks, ML hardware accelerators, and ML cloud platforms. It is driven by researchers from Harvard, Stanford, UC Berkeley and more, supported by over 30 companies such as Google, Intel, NVIDIA, Arm, AMD...and I see from the website, EDA. All three of Cadence. Mentor and Synopsys are supporting companies. The basic idea is to build SPEC-type benchmarks for machine learning. The philosophy is: Agile development because ML is changing rapidly. Serve both the commercial and research communities. Enforce replicability to ensure reliable results. Use representative workloads, reflecting production use-cases. Keep benchmarking effort affordable (so all can play). V0.5 submissions were due "around now", Cliff said. Results "soon". Takeaways Danger: the end of all three of Moore's Law, Dennard Scaling, and standard CPU performance architectural tricks. The limits of the CMOS are in sight, and Intel's 10nm challenges and GF's 7nm exit are signs. Opportunity: there is a revolution in machine learning, with economic demands for ML accelerators, architectural and codesign experimentation. Perhaps we can use ML to design better ML accelerators. Irony: exponential demand for ML computation just at the end of Moore's Law. Efficiency will thus matter a lot. There are huge opportunities for HW/SW codesign in building TPUs and other domain-specific-architectures (DSAs). Sign up for Sunday Brunch, the weekly Breakfast Bytes email.
↧