Trends in designing High Performance and Scalable Computing (HPC) Engines

By Purna Mohanty, VP Engineering and Operations USA, Tessolve

Introduction:

With the latest progress in cloud-based computing, emerging data science, and real time applications for AI, deep learning, and machine learning, the way we used to develop System-on-Chip (SoC) in the late ’90s is not the same any longer. Architectural exploration has become a key ingredient of the any SoC that requires highest performance with the lowest power consumption. We need engineers that think differently, even at the implementation level, who have the ability to think beyond their normal call of duty – to think at a system level and application level.

Developmental Stages of a HPC System:

The high performance is achieved by both high compute power accompanied by massive parallelism. Reliability, Availability, Serviceability (RAS) is a critical feature to be considered during architectural exploration. RAS is required to control processing of highly multiple complex computational tasks. The dynamic provisioning of tasks may also by required to be done on the fly by intelligent task provisioning engine implemented in hardware. To achieve these end goals, you need high speed connectivity or I/Os, and low latency and high throughput memory. How do we manage such complex interaction amongst compute engines, I/Os and memories? It is accomplished by a highly complex block called, network-on-chip (NoC), that connects to all these peripherals.

The development of a HPC chip goes through the following key stages:

Figure 1: SoC based HPC System Design Flow

Even in high performance systems, the key parameters – Power, Performance, Area (PPA) plays a critical role. Whether the chip is deployed in a data center or in a mobile environment, it becomes important that the chip stays green by reducing power consumption. Along with performance, power plays a key role during architecture and micro-architecture phases of the design. Power and performance trade-offs are made during these phases. The network-on-chip (NoC) plays a central role in a scalable high performance system.

Properties of a Network-on-chip (NoC) in a High-Performance System:

The network-on-chip (NoC) is the artery and vein of a high performance system. It encompasses the key properties of such a system. It is important that we discuss the ideal architectural properties of a NoC that is required for the HPC system.

Coherent Engine:

It is required to develop a cache coherent engine to stay in sync across processors, I/Os and Accelerators, and memories. The designing and verifying such engine is highly sophisticated, and formal methods are used to sign-off verification of such engines.

Low Latency Access:

The low latency techniques involve dynamic allocation of datapath for the data to traverse from source to destination. There are different types of data routing implementations possible based on the topology. But the key factor is to take the shortest route and avoid hotspots. This is usually attached in terms of number of clock cycles. Another way of reducing latency in memory access is to use a cache memory in front of the main memory. Sometimes, it is called L3 Cache, where the processing engine has the L1/L2 cache.

Dynamic Power Control:

Dynamic Power Control is a key feature of any computing system. Depending on the load on the system, the NoC should be able to dynamically scale up or down of its operating frequency. Also, it should be able to switch off portions of its to save power. If we need more sophisticated control, we could control the internal datapath width too.

Reliability, Availability, Scalability, Serviceability:

The reliability aspects play a key role to protect against defect and malfunctioning. The innovative methods of Design for Excellence (DFX) is required to be implemented. Other than traditional implementation areas of DFT and DFM, there will be a need to detect, report, and possibly correct errors in datapath, memories, hard-macros, analog components etc. while the chip is running.

The availability and scalability are critical attributes that defines how a chip can scale up a system. For example, let us say the chip is developed to connect with 128 processors, 2 PCIe ports and 2 DDR interfaces. We can build a system using this one chip on the platform. However, if the system requires scalability, where 128 processors are not sufficient, multiple chips can be deployed on the platform. If the platform has 4 chips, the system will look to have 512 processors, 8 PCIe ports and 8 DDR interfaces. In such a case, there needs to be inter-chip communication in place. During low load times, the system should shut itself off partially in order to save power, but wake this part up when the workload rises. This means that the several processors can go to sleep, but become available when required.

The serviceability becomes a critical feature wherein you can perform hardware update to the platform while the system is running – hot plugging and hot swapping. For example, the chip should have the provision of gracefully shutting down interfaces as and when required. This is mainly used to remove and add external devices to the chip while the chip is functional.

High Bandwidth:

The NoC should never be a bandwidth bottleneck for the memories, I/Os and compute engines that it is communicating with. The ideal role of NoC is to make sure that its ports run at the highest throughput theoretically possible. For example, a typical 48 Cores @ 1.5GHz gives a theoretical max bandwidth of 2301Gbps – both write and read. The interconnect should be able designed to sustain such a bandwidth with random traffic running amongst the cores and the memories.

Quality of Service (QoS):

The priority, latency and bandwidth allocation should be made depending on the traffic class, and any priority amongst the ports to be maintained. This controllability is important that should be controlled by both traffic class of the task in hand but also manually by the user, as and when required.

An Example Topology of a High-Performance Cache Coherent System:

An interface protocol, which works effectively for cache coherency applications is AMBA 5 CHI, and its combination with CCIX. “CHI” stands for “Cache Coherency Interface”, and “CCIX” stands for “Cache Coherent Interconnect for Accelerators”. CCIX sits on the top of datalink layer of the standard and ubiquitous PCI Express, and facilitates cache coherency for I/Os and Accelerator engines to work alongside of the ARM processors.

Figure 2: System Platform for a Cache Coherent HPC System

In the example block diagram 2, we have 3 chips – the main chip communicates with multiple processors, I/Os and Memories through a complex cache coherent interconnect. If it needs to communicate with external IOs that are required to be coherent, then it uses CCIX/PCIE layer to communicate. If the “Host CPU” works as an accelerator or offload engine for processors (PROC) of data crunching tasks. Processors (PROC) follow CHI protocol for communication. The I/O port, ethernet, communicates using standard AMBA AXI for communication. Finally, the memories like DDR communicated either using CHI or AXI for communication. The memories have the option of implementing a L3 Cache to reduce access latency.

A more complex topology is given below in the example block diagram 3, CCIX is a powerful protocol not only to maintain coherency amongst processor and the accelerators/offload engines, but also communicating across multiple platforms and chips – each platform owning its own address map.

Figure 3: Cache Coherency across two platforms and chips

Conclusion:

The future of High Performance Computing looks both evolutionary and revolutionary. CCIX is an evolutionary protocol based on AMBA CHI protocol due to the dire need of making the I/Os and Off-chip accelerators cache coherent, and of reducing latency of communication of processors with the accelerators. The revolutionary changes come up due to emerging algorithm implementations for AI, deep learning and machine learning. For example, the applications around TensorFlow require a different level of not only architectural and implementation thinking, but also an altogether different mindset for verification of such complex implementations.