Here’s an experiment to try: in a quiet (but crowded) Auditorium, drop a stack of plates. Hypothesis: At the first splinters of the crash, all eyes will jerk to the source of the sound. You probably won’t get many people listening harder to find out what happened. Yes, our hearing can give us a lot of information, too—it can indicate direction, distance, magnitude, and even give some ideas about what happened (a crash of plates doesn’t sound like a dropped bowling ball). With the information that we gather from our senses, we then decide what we should do about it: Was that crash caused by a fumbling waiter, or an earthquake? It will make the difference between a good laugh or a stampede. We live in a visual world; this is what people do. Computer “vision” is how we refer to a computer measuring something in its environment. That “something” can come from any number of sensors, detecting input from the electromagnetic spectrum (radio waves to gamma rays, including visible light), acoustic information (sound), chemical changes (detecting chemicals and reactions), the flow and pressure of some substrate (air, water, jellybeans), magnetic forces (positive and negative), physical forces (how hard a ball is thrown, how hot the water is)… the list is practically endless. If it can be measured, there is likely some kind of computer sensor that exists to detect it. And for the system to “learn”, the system must be able to “understand” the data it has sensed. In other words, the system should be able to tell the difference between an earthquake and a bowling ball. “You can program a computer to tell the difference between cats and dogs? My beagle can do that!” –Jitendra Malik, keynote speaker at the Embedded Vision Summit in Santa Clara, CA on May 2, 2017 Why Use Machine Learning? In pattern- and image-recognition applications, the best possible correct detection rates (CDRs) have been achieved using convolutional neural networks (CNNs), as opposed to traditional models of image and pattern recognition. For example, CNNs have achieved a CDR of 99.77% using the Modified National Institute of Standards and Technology (MNIST) database of handwritten digits, a CDR of 97.47% with the NYU Object Recognition Benchmark (NORB) dataset of 3D objects, and a CDR of 97.6% on ≈5600 images of more than 10 objects. CNNs not only give the best performance compared to other detection algorithms, they have even outperformed humans in many cases, such as classifying objects into fine-grained categories, analyzing skin blemishes for cancer and identifying critical design flaws in architectural structures. Neural Networks vs. Convolutional Neural Networks In the traditional model of pattern/image recognition, a hand-designed feature extractor gathers sensor data and eliminates irrelevant variabilities. The extractor is followed by a trainable classifier, a standard neural network that classifies feature vectors into classes. In a CNN, convolution layers play the role of feature extractor. These feature extractors are not designed by engineers or set by pre-set parameters; convolution filter kernel weights are determined as part of the training process. The greater the number of layers, the “deeper” the deep learning system has become. Each feature of a layer receives inputs from a set of features in the previous layer, called a local receptive field. With each layer, or local receptive field, features can extract elementary visual features, such as oriented edges, end-points, and corners, which are then combined by the higher layers that connect the information together into a classifiable set of data. To make a computer classify a set of data, the process begins with the sensor gathering relevant data (light, sound, flow, force, etc.), then the processing by the neural network begins. These steps (plus the input stage) are the three Rs, thus forming a CNN: Reorganization . The system must segment the image into usable pieces, pixel by pixel, sound wave by sound wave, byte by byte. Reconstructing the image. The system must identify the “edges” and other points of interest, using layers and layers of filters and other data to extract them. Recognition . Using statistical probability, the system then classifies the image or other sensory data into a category that the end user can understand. This is a more nuanced task than performing a simple linear regression and determining a positive or negative result. Using the information that the system already “knows” about the input data (that is, training), the system extrapolates what the image (or other sensed data) means. Typical Block Diagram of a CNN Current CNN Trends Current CNN trends have highlighted three significant challenges to moving forward: Increased computational needs . We can perform image recognition tasks and train neural networks in the cloud. Computational value is theoretically infinite. But that value is greatly diminished if you must go back to the cloud for each recognition task. The real question becomes: How do we embed the task of recognition into the application? Fast-evolving networks. Considering the speed at which neural networks are changing and developing, how do we even pick a platform today for a product that may ship in two years from now? Five years from now? With the development of new neural networks with ever-changing architectures, there is no guarantee that what works now will work in a system in the future. Manufacturers’ biggest fear is that by the time their product comes to market, the platform they picked will be as useless as a hand crank starter. Changing uses for neural networks. When machine learning was first introduced in the 1950s, it was a subject of keen interest but was limited by the computational requirements to apply the technology to the real world. Only after large amounts of computing power have become readily available and machine learning systems have scaled up to handle the computer requirements necessary, have neural networking systems become applicable to all industries. The exciting part of the embedded vision technology story as it relates to CNNs is that we’re currently in a virtuous circle: new technology spurs customer innovation, which begets new technology, and so on. These trends only continue to grow. The sensor market alone ($11.5B in 2016) is predicted to have a 10.5% CAGR forecast for the 2016-2022 period, according to Yole Développement’s 2017 CIS report. The Cadence Solution: The Tensilica Vision C5 DSP Optimized for vision, radar/lidar, and fused sensor applications, the Cadence Tensilica Vision C5 DSP is the industry’s first DSP dedicated to neural network processing and architected from the ground up specifically for multi-processors. Achieving unprecedented speeds and low power usage, the Vision C5 DSP meets all the requirements of advanced neural network technology. Built on almost twenty years of Tensilica Xtensa multi-processor experience, this solution accelerates all layers, not just convolutional functions, leaving the DSP free to run other applications. With the Vision C5 DSP, these challenges are answered: Embedded computational requirements. True for all markets, a vast amount of data must be processed on the fly. While the training of a neural network may take place mostly offline, the applications that use them must be embedded within their own system, regardless of market. No matter the application, the amount of data to be processed must happen as instantaneously as a car accident. Just as we don’t carry datacenters around with us in our car or on our device, we also can’t carry around power sources with us wherever we go. The Vision C5 DSP is optimized for neural networks, without wasting time and power. Fast-evolving environments. As the development of neural network processing grows, the products using neural networks in development now may need reprogramming by the time they are shipped. The platform must be able to grow with the industry implementing them. Simply put, the platform must be future proof. Changing applications of the technology. Even though the word “vision” is in its name, the Vision C5 DSP is designed for any kind of neural network processing, no matter whether the sensors are gathering data about light or flow of jellybeans through a factory. This DSP is architected for all multi-processor clusters. Whether your neural network is for the mobile industry, surveillance, automotive, or anywhere in between, this solution is flexible enough for all applications, from the minute scale to the grand. Tensilica Vision C5 DSP Block Diagram The Cadence Tensilica Vision C5 DSP is the industry’s first standalone, fully dedicated neural network DSP, architected for multi-processor clusters. Using a holistic view of System Design Enablement, Cadence brings the Vision C5 DSP to the table, allowing engineers to produce elegant designs with shorter verification cycles, develop software that works with the hardware, and stand out with new product leadership.
↧