At the recent Autosens conference in Detroit, Cadence's Michelle (Xuehong) Mao presented on the challenge of bringing Deep Neural Nets to Embedded. I covered the first part of her talk yesterday in CactusNet: One Network to Rule Them All . As a reminder, Michelle pointed out that there are four things that can be done to solve the problem. Optimize the network architecture Optimize the problem definition Minimize the number of bits per computation Use optimized DNN hardware Today I look at the last three. Optimize the Problem Definition The next arrow in the quiver of techniques for reducing a cloud-based DNN to an embedded DNN is to reduce the problem size. For example, the KITTI road segmentation dataset is a classification data set for identifying what is the "road" and what is everything else. Obviously, a key part of autonomous driving is to identify the road rather than driving off into the bushes. To segment an image naively requires 466K classification problems (one for each pixel, the images are 375x1242). By exploiting correlations, this can be reduced by 22X. Whereas other nets use 100+ GMAC/s, the Cadence approach is just over 10GMAC/s. Minimize the Number of Bits per Computation In the cloud where networks are normally trained, almost all the computations are done using 32-bit floating point. It would seem obvious that seriously reducing the precision would have a major impact on recognition rates, but in fact this is not the case. Often reducing to 8 bits has no effect on recognition at all, and, perhaps more surprisingly, reducing to 4 or even 2 bits sometimes has only a minimal effect. There are two approaches to quantization. It can be done post-training, where the training is done purely in the floating-point world, and the optimal net is then worked on to reduce precision. The other approach is to modify the training so that quantization takes place during training, which will typically end up with slightly different weights. It is beyond the scope of a blog post like this to go into the details of how this is done. But the table below shows some of the results ("FLP" is 32-bit floating point and "8b FXP" is 8-bit fixed-point coefficients and data). The CactusNet numbers are 2X better than ResNet-50 and 10X less than VGG-19. Use Optimized DNN Architecure The final approach to reducing power and area to reduce a cloud solution to embedded is to use a specialized architecture optimized for DNN. For embedded, a good DNN optimized processor is designed to: Minimize pJ/MAC: The fundamental building block of the whole calculation is the MAC, so they need to be energy efficient. Minimize data movement: Moving data consumes power without advancing the computation, so the less the better. Provide a sufficient number of large, scalable MACs: Just like with multi-core CPUs, the only alternative to having enough MACs is to push the clock rate up which is very power hungry. Ensure high utilization of resources: Idle resources don't waste much power, but are expensive silicon real estate when idle. If the network is fixed, then it is possible to design at the RTL level and end up with a solution that is as close to optimal as possible, albeit one that is very expensive as regards design cost. But the network is never fixed, in practice. The whole field of DNN is advancing all the time, so even if the application remains unchanged, new techniques will appear in the literature. So a programmable solution is the only way to go. Of course, in principle, you can design your own processor, but why would you? An optimized DNN architecture means something designed for the task, not a general-purpose CPU, general-purpose DSP, or general-purpose GPU (also known by the catchier title GPGPU). They all have application domains where they are a good fit, but this is not one of them. You need a specialized DSP for Neural Networks such as... The Tensilica Vision C5 DSP for Neural Networks The headline number for this processor is that it can do 1TMAC/s in under 1mm 2 (at 16nm, obviously less or more in 10nm or 28nm). It can configure as 1024 8-bit MACs or 512 16-bit MACs. Under the hood it is a 4-way VLIW architecture with 128-way 8-bit SIMD. It has integrated DMA with a 1024-bit memory interface and dual load/store. It is optimized to run all layers of the DNN, including, but not limited to, the convolutional layer(s). A high-level block diagram is below: Summary For embedded purposes, DNNs need to be improved in power efficiency by three or four orders or magnitude (a factor of 1000X or 10,000X...or more). DSP techniques can contribute one or two orders of magnitude (10X to 100X). The best processor is the Tensilica Vision C5 DSP, the industry's first complete, standalone, NN-optimized DSP IP core for surveillance, automotive, drone, mobile, and wearable markets. More information on Cadence's Tensilica Vision processors is on the Vision DSP product page .
↧