In recent discussions with leading automotive, AR/VR, and semiconductor providers, I was struck by how quickly artificial intelligence and computer vision had converged to become so core to their business value. It also strengthened my appreciation for the challenges they faced in developing and deploying these nascent technologies.

Over a few short years, artificial intelligence (AI) has moved from fiction and academic research to increasingly being part of our personal and professional lives. And more specifically as it pertains to machine learning, the diversity of markets and applications that have recently been enabled is truly impressive.

In medicine, a collaboration between Sutter Health and the Georgia Institute of Technology used machine learning to analyze and interpret electronic health records to predict heart failure up to 9 months in advance. Similar approaches are now being broadly applied within cybersecurity to address ever more sophisticated malware threats and within finance for algorithmic trading and fraud detection. This technology is also behind interactive IoT capabilities in products such as the Amazon Echo, the Nest thermostat, and Apple’s Siri, and can even turn you into a modern Van Gogh.

These capabilities are not only technically impressive but are expected to drive an era of incredible growth. Market research firm Tractica forecasts that annual worldwide AI revenue will increase from $643.7 million in 2016 to $38.8 billion by 2025.

Figure 1. Artificial Intelligence Revenue, World Markets: 2016-2025

Figure 1. Artificial Intelligence Revenue, World Markets: 2016-2025

The Rise of Machine Learning
So what triggered the transition of machine learning from decades of algorithmic research to commercial deployment? I see three main factors:

  1. Processing performance. Machine learning algorithms – especially training deep neural networks – require significant processing performance. Processor and FPGA semiconductor companies have pushed the silicon state of the art; NVIDIA’s latest Pascal GPU can deliver over 21 TeraFLOPs of 16-bit floating point performance, cutting training times in the datacenter from weeks to days. Inference, or deployment of trained deep neural network algorithms, can now effectively run on embedded or IoT node devices for image recognition or natural language processing and not only in the cloud.
  2. Inexpensive sensors. The smartphone industry has been the catalyst for driving the availability of high quality but inexpensive imaging, motion, and environmental sensors. This has enabled practical and affordable integration of sensors into IoT devices for wearables, connected cities and homes, surveillance, AR/VR, and drones – to name a few – that will generate an enormous amount of data.
  3. Enormous amounts of data. According to Cisco, by the end of 2016 annual global IP traffic will pass 1 zettabyte (1021 bytes) and by 2020 global IP traffic will reach 2.3 ZB per year, with video being 82% of all traffic. To put it into perspective, global IP traffic will reach 511 terabits per second (Tbps) in 2020, the equivalent of 142 million people streaming Internet high-definition (HD) video simultaneously, all day, every day. This, combined with the similar growth in data from internet search, consumer transactions, electronic medical records, etc. provides an incredibly rich data set for training machine learning systems.

Adding to this are the emergence of open source machine learning frameworks such as TensorFlow and Caffe that will accelerate adoption and deployment machine learning to many more applications not even thought of today.

Machines that Perceive
Computer vision applies digital processing and intelligent algorithms on images or video to optimize imaging quality and interpret actionable meaning, and it has been a sweet spot for machine learning techniques, especially deep learning. In this hot area, few use cases have captured the imagination and investment dollars more than self-driving vehicles.

While the promise of autonomous vehicles is at once thrilling for its ability to save lives, reduce costs, and save time, deflating to those who want to drive, and daunting due to regulatory complexity, there is no question it will come based on the sheer amount of investment in this space. In 2016 alone, Intel stated it would invest $250M, Uber and Volvo would contributing $300M, Uber invested $680M in Otto are a few of the many examples. But we’re still years away from fully self-driving cars being commonplace, with many industry experts and analysts seeing the tipping point being achieved no sooner than the early 2020s.

The path to SAE Level 5 full autonomous driving is being built on the shoulders of the Advanced Driver Assistance Systems (ADAS) being developed and broadly deployed into cars and trucks today, providing Level 1 (driver assist, e.g. lane departure warning) and Level 2 (partial automation, e.g. lane keeping) capabilities. These systems employ advanced AI algorithms processing data from many sensors such as cameras, short and long range radar, and ultrasound. Newer sensing technologies, such as LIDAR, are on the horizon and will be integrated once costs come down, dovetailing nicely into the sensor suite necessary for partial and full autonomy.

Today, a significant portion of the ADAS heavy lifting is being done by cameras. Earlier ADAS systems typically deployed one front-facing monocular or stereo camera for driver warning tasks and a “dumb” rear camera. Today, 4-camera systems are being deployed to add surround view and intelligent backup features. Tesla recently announced that it will move to providing an 8-camera system for all new cars and the forthcoming Model 3; these models also include driver-facing cameras for drowsy or inattentive driver detection and future gesture control. The breadth of ADAS tasks expected of the camera system has grown dramatically due to the recent improvements in the optics, sensors, imaging quality, and computer vision processing.

Figure 2. Sensors for ADAS Systems and Autonomous Vehicles (Source: Yole Développement)

Figure 2. Sensors for ADAS Systems and Autonomous Vehicles (Source: Yole Développement)

These sensors provide a massive amount of data that needs to be processed to perceive and properly respond to the dynamic environment of a high-speed highway or a pedestrian crowded intersection. Multiple classical and machine learning based approaches to image and signal processing are employed to convert the digital bits representing a video frame or radar signal into a model of the world surrounding the car that can then be intelligently acted on… such as applying the brakes in an emergency stop from 60 mph when a deer is in the road in front of you. A human with good reaction time can stop an average car at that speed in about 270 ft., assuming a fast 1 second reaction time and about 4.5 second braking time. We expect the car to do better.

The ADAS computer vision system typically needs to process 7 frames at 2MP coming in at 30 fps before it can detect there is an obstacle the car is moving toward, saving about ¾ of a second in reaction time and stopping the car more quickly in about 200 ft. Having “situational awareness” requires the system to do that across all the cameras and other sensors at once. This requires a massive amount of computer and is why you are essentially seeing cars becoming supercomputers.

The Computer Vision Accuracy Challenge
The ongoing goal for developers and integrators of these systems is to enable the vehicle to see more clearly, improve accuracy, and react more quickly to avoid accidents. To see more clearly and farther, developers are increasing image resolution, improving image quality through more capable and tuned image processing, and taking advantage of the latest camera components. They are implementing ever more sophisticated computer vision algorithms based on deep neural networks to improve accuracy of detection and classification. To react more quickly, more powerful processors and performance-optimized code are required. Fortunately, while innovations in machine learning and computer vision algorithms have been advancing rapidly, the semiconductor industry has been delivering the performance needed to run those algorithms.

But a key technical and business challenge for the developers of vision systems is how image quality and computer vision tuning is done today. A typical vision system consists of:

  • a set of optics
  • the sensor
  • an Image Signal Processor (ISP) pipeline, typically implemented in hardware the system SoC or on a separate SoC, DSP, or FPGA
  • a processor or GPU executing the classical and deep neural network computer vision algorithms
Figure 3. Vision System Components

Figure 3. Vision System Components

The ADAS team evaluates the different imaging and processing component options against the requirements to determine the right system architecture. They balance function, form factor, performance, power, thermal, cost, etc. attributes and add some headroom to determine the best configuration within tolerable risk for the success of the project. This is initially done by a combination of reviewing component specs, modeling with Matlab or other environments, and additional software simulation and tuning to achieve a first-pass approximation of whether the system will meet the requirements. They then move to a prototyping phase where the components are put together in a lab and both test and real-world images are passed through the system to determine actual image quality and vision accuracy.

It is during this phase that the optimization challenges become more apparent. A typical ISP alone may have 100s or 1000s of tuning parameters to achieve the best sharpness, noise, and balance needed for the imaging or vision application for that camera. For example, the front-facing camera needs to be tuned for pedestrian/object detection, so edge sharpness would need to be is prioritized over visual quality. But the rear camera needs to balance image quality with vision accuracy for the driver. The multi-camera surround view capability must look good, balancing the image stitching and blending with the functions of parking assist and blind-spot detection. In these latter cases, specific ISPs for vision and image quality for may be use in parallel.

To achieve the best results, the actual process involves expert imaging engineers manually setting the parameters, feeding test images in, evaluating the image quality manually with their expert eyes and with tools (e.g. from Imatest or DxO), readjusting the parameters, and iterating hundreds of times. This process can take many weeks or months to complete for the team… per each camera configuration! If the team wants to evaluate a new sensor, optics, or ISP, the process starts all over. If they don’t have the expertise in house, they need to go to a specialized services company to get it done. This costs most companies 100s of thousands of dollars per camera if they are trying to achieve commercially acceptable results. Cell phone providers, for example, live and die by the image quality produced by their cameras and therefore spend millions on this task per camera system.

Deep learning based vision algorithms need both training and optimization. Training of the model parameters involves feeding a large quantity of annotated images for supervised learning into the algorithm for it to understand the difference between a sign and a tree, or safe free space for driving vs. obstructed road space. To minimize bias, the training dataset is scrubbed to remove frames that have redundant information to prevent the network from overly weighing those attributes and artificial changes (positional shifts, rotations, etc.) are introduced to help teach the network to recover from deviations from the expected result. The order of the frames is randomized and re-fed into the networks multiple times to remove order bias.

Neural network architectures are also defined by a large set of parameters such as the number of network layers, sequence of layers, convolutional filter size, stride, accuracy etc. These design parameters critically affect the network’s performance in addition to the trained model parameters. Much like tuning the iSP for image quality, such parameters are selected by deep learning domain experts manually tuning and iterating over the massive set of choices in the design space for weeks or months. This makes developing state-of-the-art highly accurate vision systems very challenging and – similar to the imaging team – the process is manually intensive, costly, and time consuming.

Figure 4. Deep Neural Network Training (Source: RSIP Vision)

Figure 4. Deep Neural Network Training (Source: RSIP Vision)

Furthermore, there is limited collaboration in many cases between these two teams to “cross-tune” for best system results due to the enormous parameter complexity and the different domains of expertise. One may create new ISPs tailored for the specific vision application and algorithmic approach but it would be even more time consuming and costly, depending on the application. And frankly, due to demand by the industry heavyweights, the imaging and computer vision expertise required is already in very short supply so broadly scaling this expertise isn’t likely.

This vision optimization problem, which directly impacts differentiation, cost, and time to market, is an area ripe for improvement. A better paradigm involving intelligent and automated optimization is needed.

A Smart Approach to Better Vision
To achieve the best system results, this massive heterogeneous problem needs to be approached holistically to overcome the fundamental limitations of tuning each domain in isolation. For example, tuning the ISP independently for perceived image quality will remove noise in the image but doing so will remove feature activations from the first few layers of the neural network, impacting object detection. Teams of hard-to-hire PhDs can understand what needs to be done but can’t solve this manually. It must be addressed by intelligent optimization leveraging available compute resources to abstract the complexity and accelerate the process.

In order to achieve order of magnitude improvements, a pragmatic optimization platform should have the following attributes:

  • Goals-based methodology that enables the team to set specific optimization goals holistically for the system, rather than trial and error
  • Advanced machine learning solver technology that learns how the parameter combinations interact and affect progress to the goals to quickly converge on the most optimal solution
  • Integrated analysis capability to determine progress against the optimization goals
  • Management cockpit for the optimization process and goals, test images, regressions, configurations, and manual parameter controls for expert fine-tuning
  • Output of final tuning parameters, custom reporting, and optimized images with computer vision overlays for visual inspection
  • Out of the box support for ISPs, machine learning frameworks, and industry standard image quality goals
  • Ability to deploy on premise or via the cloud for flexibility and scalability

Taking this approach allows you to abstract and accelerate the process dramatically, thereby giving you the option of getting to market faster or evaluating many more combinations of components to reduce BOM cost. It allows you to better leverage your staff by applying your novice engineers to the tuning effort while your in-house experts can focus on the more challenging scientific and engineering tasks.

Even more exciting is when you consider additional system attributes that could be evaluated and holistically optimized, such as system power or performance.

This new approach will have tremendous impact. Industries such as autonomous vehicles, AR/VR systems, smart cities, factories, and homes, security and surveillance, and medical imaging will continue to push state of the art requirements for vision systems. This ongoing innovation will continue to drive increasing complexity, resulting in a tipping point where the current tuning methods can no longer keep up. The only way forward is with a paradigm shift to intelligent and automated optimization.