Quantcast
Channel: Cadence Blogs
Viewing all articles
Browse latest Browse all 6678

How to Optimize Your CNN

$
0
0
Convolutional neural nets (CNNs) are not programmed in the traditional sense, but rather they are trained. The challenge to doing this is that you need a lot of good data which is already classified as the training material. The process is not that different to some of the early training your brain got. Your parents probably had some picture books with pictures of cats, dogs, cows and so on, and would tell you what they were (and probably what noise they made but we’ll just focus on visual identification here). One dataset that is widely used for training is the German Traffic Sign Recognition Benchmark (GTSRB). This contains over 50,000 European traffic sign images ranging in size from 15x15 to 223x193 pixels. The signs are not perfect examples like in a driver’s handbook, they are partially obscured, with graffiti and other damages. I assume this was put together by hand since it is relatively easy to do the classification, as there are only a few dozen different types of signs. Since this is a standard benchmark with well-defined answers, it is regularly used to “score” different recognition approaches. This is an oversimplified benchmark since in a real ADAS (or autonomous vehicle) the signs are moving (across the image field, they are obviously not really moving), they have to be found in the image field, and they are changing in size as the vehicle approaches. Nonetheless, it is a base case. There is also a lot of information since the camera in a vehicle captures full-motion video, meaning that the recognition does not need to be done on a single frame, but can combine multiple frames to improve recognition. CNNs are now better than humans at the static version of this task. In the cloud, developers are typically interested in the absolute accuracy and are not especially concerned with how much compute power is required. But in the embedded world, such as in a real vehicle, the cost of the silicon (and other factors like reliability) are part of the tradeoff. The challenge is how to make major reductions in the cost, measure by power and memory requirements mainly, with the most minor impact on accuracy. There are lots of parameters that can be altered that will affect the efficiency and the accuracy: Number of layers Connectivity between layers Dimensions of convolution kernel per layer (n x n x depth) Data-types for weights and data: 32b float, 16b fixed, 8b fixed Number of feature maps per layer Overall efficiency of the underlying processor/silicon The methodology used is simple in concept and complex in execution. Step one is to train the network and optimize it to get the best recognition (or acceptable recognition, depending on requirements) in the usual way, probably in the cloud. Step 2 is to vary parameters 1-5 above (probably can’t do much about 6 in this context) using statistics and linear algebra as a guide and using the validation step as guidance for convergence. There is a lot of redundancy in the filter weights and with this approach we can exploit that. The diagram above shows a simplified version of the process. The top line shows the starting point, which achieves 99.26% recognition but at a cost of 366 MMACs. Each subsequent line shows another datapoint. Surprisingly, the third line achieves even better recognition percentage than the first, at a cost of only 21 MMACs. We can go further, the last line only requires 5 MMACs, but the recognition rate has fallen unacceptably below 99%. So given these choices the third line would be the best. In reality, many more than four choices would be explored, potentially hundreds. The red dots in the above diagram shows a more practical exploration. The Y axis shows the recognition rate and the X axis shows the MMAC count. So up and to the left is good (higher recognition, lower MMACs). If we actually have some requirements, indicated by the black dotted lines, then anything in the upper left quadrant satisfies the requirements. There is more than just the number of MMACs to consider. The power and area in an embedded system are affected by the data precision, which can be reduced from 32 bits to 16 or 8. Sometimes even four bits. Surprisingly this doesn’t always have any significant effect on the recognition percentage. The precision in the CNN can be mixed. The secret sauce is how to do the exploration and avoid the “British Museum Problem” whereby you wander around a lot at random but unless you go to just the right place, you miss something. The results are impressive. Optimizing the MMAC count gives a 12X reduction in computation (and thus power). Varying bit widths to optimize coefficient and data storage results in a 13X reduction in memory, saving area (and power). Putting it all together, network dimension optimization gives us a 10X reduction and fixed-point quantization another 10X, resulting in an overall reduction of 100X. Next: Hierarchical Neural Networks Previous: Power Efficient Recognition Systems for Embedded Applications

Viewing all articles
Browse latest Browse all 6678

Trending Articles