The trajectory of electronics technology over the past half century is clear: Moore’s Law scaling, the explosion of software and ubiquitous networks have combined to change all communication, all industrial operations and all media completely and irrevocably. We casually label all these changes with the phrase “intelligent systems”, implying human-like characteristics to these systems, but the label is misleading. Our phones, our laptops, our Wi-Fi connections, our social media sites, may be built by and channel human intelligence, but their behavior cannot truly mimic human behaviors or rival human intellectual strengths. The phrase “ambient intelligence” has been tossed around in recent years too, typically meaning some loose combination of embedded, context-aware, personalized, adaptive and anticipatory systems, according to Wikipedia. But the industry is getting more ambitious and focused in creating ambient intelligence, leveraging the on-going revolution in cognitive computing and neural networks.
After simmering on research back burners for a couple of decades, a new wave of cognitive computing methods has emerged. These methods, built on “deep learning” and neural networks come much closer to mirroring human recognition and even reasoning abilities. Face recognition, classification of objects and descriptions of complex scenes, and continuous speech recognition and translation have advanced dramatically in the last five years. In some cases, in computer vision for example, neural network based object classification accuracy now exceeds that of humans. Figure 1 shows the winning image recognition error rate on the ImageNet Visual Recognition Benchmark, which tests classification of objects into 1000 different categories for each year since 2010. After years of slow progress in recognition rate, the introduction of neural networks, marked in green, triggered an unexpected acceleration in recognition rates, to exceed those of human classifiers (which show an error rate of about 5% of this test).
It seems likely that human-level recognition will progress fairly rapidly in two key dimensions. First, the recognition rates for still images, video sequences and natural language will improve still more, making machine recognition competitive with human recognition in more and more sophisticated and seemingly ambiguous scenarios. Second, these recognition algorithms will evolve as potent tools for reasoning about subtle context, likely intent, and preferred strategies for response in complex situations. At the same time, these powerful recognition and reasoning mechanisms may be combined with access to data networks that provide seamless access to global, real-time information on the environment, surrounding human activity, social interactions and business transactions. Together, the sharp improvement in recognition and human-like reasoning, and the data for context have a good chance of enabling true intelligence, reminiscent of human recognition capacity, a useful add-on to our interactions with the world. Furthermore, that intelligence will not just be in response to explicit user demands, but implicit in the alertness of the electronics systems, constantly monitoring the changing environment across many sensors, and across many data streams. This gives us prospects for real ambient intelligence.
But where did these developments come from, and what do they mean for systems, especially for semiconductor-based electronics.
Three forces have triggered this explosion of new technology. First, the Internet has aggregated previously unimaginable reservoirs of raw data, capturing a vivid, comprehensive, but incoherent picture of the real world and human activity. This becomes the foundation from which we can train models of reality, unprejudiced by oversimplified synopses.
Second, progress in computing and storage, have made it practicable to implement large-scale model training processes, and to deploy useful inference-based applications using those trained models. Amid hand wringing over the so-called “death of Moore’s Law” we actually find that a combination of increasingly efficient engines and massive parallelism is giving us sustained scaling of compute capability for neural networks. Today, GPUs and FPGAs are leading hardware platforms for training and deployment, but we can safely bet that new platform architectures, built from direct experience with neural network algorithms, are just around the corner.
Third, we have seen rapid expansion of understanding of the essential mechanisms and applications of neural networks for cognition. Universities, technology companies and end-users have quickly developed enthusiasm for the proposed benefits, even if the depth of knowledge is weak. This excitement translates into funding, exploratory developments and pioneering product developments.
These three triggers – massive data availability, seriously parallel computing hardware, and wide enthusiasm – set the scene for the real work of bringing neural networks into the mainstream. Already, we see a range of practical deployments, in voice processing, automated translation, facial recognition and automated driving, but the real acceleration is still ahead of us.
One of the most intriguing and powerful aspects for deep learning is the special role of example data. The intelligence of these systems does not come from their programming, but from their training – the automated derivation of a highly accurate model of desired response based on thousands to millions of example input stimuli (images, video sequences, audio) along with expected response. The training process effectively deduces the common features of the stimulus uniquely associated with each expected response, so that the trained system can response the same way to other occurrences of the same semantic input, even when it occurs in a very different context or background. The ability to recognize complex, obscure patterns relies upon building deep layers of computational “neurons” with response weights tuned during the training process. Once the network is trained, the same network structure and weights can be deployed on the same hardware, or a quite different hardware platform to do recognition tasks – performing inference on unlabeled data such as images. In this case, the purpose of the inference is not to compute errors, but to identify the most likely label or labels for the input data. The relationship between training and inference for image recognition is sketched in Figure 2.
Convolutional neural networks (CNNs) are a particularly important case of neural networks where the same weights are used uniformly on different sections of the inputs, and the response function is based on a sum-of-products of inputs and weights, in the form of a dot product or convolution operation. Typical CNNs use 8 to 150 distinct layers of pattern recognition, building up a deep pattern of recognition layers that respond to increasingly complex patterns and patterns of patterns.
These characteristics drive the architecture of CNN inference hardware, especially in embedded systems where compute rates and memory bandwidth have directly quantifiable impact on throughput, power dissipation and silicon cost. We might expect to see the following hardware capabilities dominating the CNN inference hardware choices:
- Multiply-accumulate (MAC) rate, and efficiency measured in MACs per watt, MACs per mm2 and MACs per dollar.
- Memory capacity, especially local or on-chip memory capacity, to hold both weights, input data an intermediate neuron result data.
- Memory and bus bandwidth to bring data, and especially coefficients
Even as ambient intelligence systems are evolving rapidly, we can make some useful forecasts of the implications for semiconductor platforms:
- Different system-on-chip devices will be built for a diverse range of end applications – mobile phones, cloud computing, automated driving, industrial IoT – but will increasingly include major deep neural network subsystems. These subsystems will likely become more widely used and more centrally important than other popular computing subsystems for graphics, audio or wireless communications modems. While some infrastructure platforms may target a combination of neural network training and inference, most high volume platforms will focus on maximizing performance and efficiency for inference.
- Convolutional neural networks tuned for very different tasks share one important and unusual technical characteristic – on average they perform many tens or hundreds of compute operations for each loaded input operand or model weight value. This means that systems with even modest external memory and storage bandwidth can usefully exploit massively parallel computation structures. Specialized neural network processing engines, with hundreds or thousands of parallel multipliers can deliver the equivalent of tera-flops (1012 multiply-add ops per second) of computing power in a small silicon area and power footprint.
- Ambient intelligence depends on alertness – constant vigilance analyzing video, audio and other sensor streams. The sensing systems will often be deployed in massive numbers, and fit small cost budgets and power profiles. A high degree of specialization of the silicon to the specific sensor data stream will be common. In addition, the aggregate bandwidth of raw data, especially video data, from massive deployment will often make cloud-based cognitive processing unfeasible, especially for wireless-connected sensors. This means that much of the cognitive processing must be done locally, based on neural networks downloaded periodically from the cloud. The sensors become very intelligent cognition nodes that pass only high-level recognition events of interest up to the cloud, and that locally generate detailed actions – video, audio and mechanical activation.
- System intelligence based on software programming is transitioning to system intelligence based on trained neural networks. Neural networks are remarkably portable, so that the exact same network structure and trained network weights can be deployed on a CPU, a GPU, and FPGA and a DSP without loss of accuracy, but with power consumption and performance being the main variables. This raises the level of abstraction and reuse for system developers and makes the hardware platform development more flexible. Specific processor architectures and their software ecosystems will have less of a lock on the market.
In some important ways, the era of ambient intelligence is already becoming visible in real world deployment. We see it in self-driving cars from Google and others. We see it in Facebook software that silently recognizes all the individuals in pictures passing through their service. We see it in voice-recognition systems like Siri on Apple products, the Assistant on Google Home and Alexa on Amazon Echo. All of these qualify as examples of sophisticated ambient intelligence and all or most leverage neural networks at the core of that intelligence.
The rise of cognitive computing and ambient intelligence will not always be smooth. It is safe to predict two types of hurdles. On one hand, the technology will sometimes fail to deliver on promises, and some once-hot cognitive computing technique will be abandoned in favor of new network structures, training methods, deployment platforms and application frameworks. We may even think sometimes that the ambient intelligence trend has failed. On the other hand, there will be days when the technology appears so powerful as to be a threat to our established patterns of work and life. It will sometimes appear to achieve a level of intelligence, independence and mastery that frightens people. We will ask, sometimes justifiably, if we want to put decision making on key issues of morality, liberty, privacy and empathy into the hands of artificial intelligences.
Nevertheless, I remain an optimist, on the speed of progress and depth of impact, as well as on our ability and willingness to shape this technology to fully serve the ends of mankind.