Over the past 30 years processor development has been on a tremendous path. Transistor densities rose according to Moore’s Law, and the clock frequencies of processors and their cores are now in the 4 GHz range. Due to the vast number of transistors available on a reasonably-sized die even processors intended for consumer devices have at least 4 cores, and usually Level 1, Level 2 and oftentimes even Level 3 Caches. Unfortunately, the development of memory, particularly DRAM, has focused more on density than on performance, forcing the use of ever more levels of ever-larger caches to hide the disparities of performance between memory and processor cores. While Synchronous Dynamic Random Access Memory (SDRAM) went from single Data Rate to Double Data Rate now available in version 4 (DDR4), the DRAM cells and the core clock frequencies have not risen at the same rate as processor clock frequencies. The DRAM core clock frequencies have increased from 100 MHz to still below 500 MHz today, while the interfaces gained some performance, currently at 3200 MHz symbol rate. However, two limiting factors remain: the interface is still SSTL-2, and the fact that the interfaces have a higher bandwidth and a marginally lower latency than 30 years ago, the number of wait states is still tremendously high due to the fact that the internal memory bandwidth is very limited. For all practical purposes, a DDR3 or DDR4 SSTL-2 interface delivers 17 to 21 GB/s of bandwidth in burst mode. That interface is half-duplex, and it does not support a request queue. In other words, if reordering should (or must) occur, the DRAM Controller has to execute that function. However, the DRAM Controller is on the CPU die (part of the “uncore”). DDR3 and DDR4 are fairly wide buses and require 208 pins or balls, and that limits total processor I/O. Four of those interfaces take up 832 pins or balls, and six of them will require a total of 1248 pins or balls. While dice can be made to support several thousand bumps, the packages for cost reasons are limited to around 2500 balls or pins. As a result, six DDR3 or DDR4 interfaces already take up about half of all pins or balls that are available on a package, and the total I/O achievable is around 100 – 120 GB/s.
If an even higher level of I/O is required, then at this point in time on existing processors that I/O is limited to communication between two or more ASICs on a single substrate. There is currently no processor in existence that provides an I/O bandwidth that exceeds 200 GB/s off-chip. On- substrate bandwidth today is limited to about 1 TB/s and it is limited to package-internal devices such as a GPU and an HBM or HBM2 DRAM ASIC. While that of course increases the processors’ (or GPU’s) total I/O bandwidth, the HBM or HBM2 module essentially becomes an L4 Cache.
As a result, we now have a situation in which processors have vastly outgrown the ability of the main (i.e. random access) memory or DRAM to deliver the required data. This would be bad enough if the computational requirements had remained the same, but unfortunately we observe a variety of trends that indicate that our appetite for CPU cycles increases at an exponential rate. Big Data, “The Cloud”, Machine and Deep Learning as well as Artificial Intelligence on top of the ever- growing needs for traditional HPC require new approaches.
It is apparent that the Flash on the DRAM bus will need to understand and execute DRAM commands. Alternatively there will have to be a converter from DDR3/4 DRAM to (ONFI) Flash, but in either case there is going to be additional latency. The problem with limited pincount is not solved either. Density is also still an issue, and as a result, there are going to be PCIe-attached Flash and conventional SSDs in those servers.
Hyperconverged Servers – History and Architecture
Initially, supercomputers were very different and distinct from any regular computer, and they were based on special (mostly GaAs) processors. They differed from minicomputers, mainframes and from PCs, and they did not deploy any processor used in any of the other computer categories. At that time, internal interconnects were vastly faster (both in bandwidth and in latency) than anything a network could offer, so compute and storage were in the same computer. A supercomputer would contain not only the compute subsystem, but also the storage subsystem. The same was true for minicomputers and PCs and of course PC-based servers.
It turned out that by not building special supercomputers and special processors for those supercomputers, and instead using networked industry standard PC-servers, more MIPS and FLOPS could be bought for the same price. These new supercomputers were PC-based servers connected via the fastest interconnects that were available. That worked out reasonably well, but the storage tasks interfered with the compute tasks in a negative way. The discrepancy between the performance of the compute tasks and the available disk I/O kept growing. The System Administrators of this type of supercomputers profiled typical loads and tasks, and they found that the compute tasks were held back and in general waited for disk I/O for a good portion of the time. Hard disks were very slow, tapes were even slower and data caching and management was needed. Disk caching and storage management was put as software onto clusters of industry- standard servers to take over all storage management tasks. Those tasks were computationally moderately intensive and only required large memory on processors that were cheaper than the compute cluster processors.This allowed supercomputers to be logically split into two different partitions, namely the compute clusters and the storage clusters. Compute clusters focused on compute, i.e. they contained lots of CPUs and accelerators and memory. They worked on the initial problem set that the supercomputer was designed for, and with the
This allowed supercomputers to be logically split into two different partitions, namely the compute clusters and the storage clusters. Compute clusters focused on compute, i.e. they contained lots of CPUs and accelerators and memory. They worked on the initial problem set that the supercomputer was designed for, and with the partitioning they were relieved of all storage functions.
The storage tasks included primarily disk caching and other storage and storage management tasks, including snapshots, de-duplication, tape operations, backup and restore as well as archiving and journaling. These tasks became more and more sophisticated and ultimately migrated into a set of industry-standard servers managing a very large number of hard disks, and that combination evolved into what is known as the storage appliance. Doing so cut down on cost while improving performance. This architecture worked well for as long as the network was faster than cached disk I/O. The advent of Flash in the form of SAS- or SATA-attached SSDs started to change this. PCIe-attached storage provided levels of performance that network-attached storage simply could not match any more.
Oftentimes special network infrastructure was needed for the Storage Nodes, such as InfiniBand or FibreChannel. While those networks provided lower latencies, it forced the supercomputer (or data center) operator to set up and maintain two separate network infrastructures.
With Flash memory now being mainstream and offering a tremendous performance per Dollar spent oftentimes dedicated storage appliances are unnecessary and thus the old is new again except for the new name: the “hyperconverged server”. Flash is used in servers to improve density and performance. While density is ever-increasing, performance is limited by the infrastructure that connects Flash to the CPU. Current processors use DDR3 and DDR4 buses to connect to DRAM and Flash, and PCIe-enabled Flash for larger arrays. However, DDR3 and DDR4 are limited to about 17 – 21 GB/s of realistically usable bandwidth, and queuing and reordering must be done in the CPU- internal DDR3/4 Memory Controller if Flash is on the DDR3/4 memory bus, and within the NVMe module if Flash is on a PCIe port.
A very simplified form of a hyperconverged server is depicted in the following schematic.
The following schematic shows the proposed solution for a processor that would be used in one of these hyperconverged servers.
This does not seem to solve the problem as the number of bumps on the die again grows, the power consumption grows and is even more concentrated in one device, and I/O has not really grown other than the added L4 Cache on the same substrate (in a multi-chip module).
Certainly with more cores available per processor, and more and larger caches the performance of the processor itself and of the computer will increase, and many more storage-related tasks can be executed on this processor, but the limitation of I/O including scale-out remains.
Limitations
The limitation in today’s computers are memory and peripheral I/O bandwidth and latency, namely DDR3/4 and PCIe. Any attempt to scale out the performance of a server is limited by these. With 6 DDR3/4 DRAM ports and up to 64 lanes of PCIe Gen3 a processor is quickly approaching the limits of what is economically feasible to provide in a package. The 64 PCIe lanes only support 4 16-lane devices, and that means that with a GPU and a 100 Gbit/s NIC being mandatory, the total peripheral I/O to and from other servers is limited to under 12 GB/s through the NIC, and that leaves only two ports for NVMe with a total of 2 * 15.75 GB/s of PCIe-attached Flash memory interface bandwidth. In other words, no matter how the problem is approached, the scale-out is limited to the network bandwidth of around 12 GB/s (100 Gbit/s) or 15.75 GB/s on PCIe per port. There simply is no infrastructure in current servers that supports any more scale-out. Even trading off NVMe Flash versus NICs does not solve the problem.
Both NICs and NVMe cards share the limits of today’s peripheral I/O infrastructure within the server. PCIe is a switched infrastructure that provides around 15.75 GB/s per port for Gen3 in a 16-lane configuration. As a result, it is blocking-free and shows no contention. However, each port requires pins or balls on the CPU or an equivalent NorthBridge. This brings us back to the problem we saw in the CPU schematics. The dilemma is that no matter how we try to scale out, we are limited by the I/O and the infrastructure. That is true for storage, for accelerators and for clustering through NICs.
Just as an example for why it worked fine for hard disks and tape, but it starts to fail for SSDs and definitely does not provide benefits for PCIe-attached NVMe let’s consider a hard disk with a peak I/O bandwidth of 300 MB/s. Let’s use 8 of them in an array and assume that their I/O bandwidth is accumulative. Then we’d deal with 2.5 GB/s per server. That is easily dealt with by the server CPU and won’t even start to saturate a single PCIe Gen3 * 8 lane port. An SSD can easily maintain 450 MB/s and start to saturate a SATA channel. 8 of them would peak at 3.6 GB/s but since even on random access their latency is much lower than for a hard disk, the sustainable data rate would be vastly higher than that of the hard disk. Nevertheless, even that array would not saturate a single Gen3 16-lane PCIe port. As a result, the total performance would be higher if we used a server to cache all disk accesses and connect that server to the compute node via 100 Gbit/s Ethernet. The nominal data rate on that interface is at or around 10 – 12 GB/s, and that would start to saturate a PCIe Gen3 16-lane port. If, however, we chose to deploy a PCIe card with 8 M.2 NVMe cards built into it, we easily will saturate the PCIe port as each of the M.2 cards peaks at above 2 GB/s, and therefore 8 of them will saturate a Gen3 16-lane PCIe port.
In hyperconverged servers latency-sensitive requests to mass storage do not have to traverse the network any more, and that will save some latency, but at the same time disk caching is going to be limited or eliminated. Disk caching is limited to the performance advantage that DRAM brings over PCIe-attached NVMe or arrays of SSDs; as we have seen, we can saturate a PCIe Gen3 16- lane port with 8 parallel M.2 modules, and that performance is very close to the performance of a single DDR3/4 DRAM channel. With increasing performance of M.2 the value of disk caching will be severely diminished.
PCIe Gen3 in the 16-lane variant peaks at less than 16 GB/s, and that is the per-port limit that can be achieved using PCIe-attached Flash or network-attached storage. As a result, high-performance storage was again built into the compute nodes, and only the nearline and offline data storage on SATA disks and tape as a backup is now left on storage appliances. In essence, what used to be a performance-enhancing technology now has become simply a bulk storage and backup/restore technology, possibly with features for de-duplication, archival and journaling. This has caused the convergence of compute and performance-oriented storage, creating the hyperconverged server. Bulk storage and other non-latency-sensitive tasks such as de-duplication and backup/restore will still be the domain of storage appliances.
We are presenting an infrastructure that is based on a standard interface and does not have these limits. We are working with a general-purpose processor provider who is implementing 8 of our interfaces into their CPU, allowing 8 of our memory ASICs to be attached with a single hop, for a total of 4 TB of memory at a bandwidth of 480 GB/s. For systems with two or three hops the total memory size can be increased to 8 or 12 TB, again at 480 GB/s bandwidth.
SSRLabs’ UHP versus HBM and Gen-Z
SSRLabs has developed a Universal Host Port (UHP) that is capable of connecting processors to memory, to coprocessors and to accelerators. It supports a coherency domain of any desired size. It is a truly universal and unified port as it can be used as a source or target and master as well as slave interface. It simplifies design and layout as it is capable of polarity and lane reversal internally. Since our ASICs all contain internal switch fabrics, the UHP can be used to scale out, and no external switch fabric is needed. All ports of any of the ASICs we have developed can be connected in any fashion, without needing any external components. As these are interfaces intended to connect to devices outside of the substrate, the bandwidth per pin is comparable to HBM or HBM2,
but substantially higher than DDR3/4 SSTL-2 based DRAM. Its latency is higher than HBM or HBM2, but again vastly lower than DDR3/4 SSTL-2 based DRAM. Compared to HBM with its nearly 1100 signal connections needed for 500 GB/s of bandwidth, SSRLabs’ UHP only needs 84 balls (16 lane pairs per direction plus clock, or a total of 68 signals) for 60 GB/s of full-duplex and bidirectional fully queued traffic. 84 connections on a PCB do not pose a problem, but nearly 1100 signal connections for HBM or HBM2 do, and therefore all HBM or HBM2 is good for is essentially L4 Cache on the same substrate that the CPU or GPU reside on. SSRLabs’ vlcRAM is a 512 GB memory, and more importantly, SSRLabs’ UHPs can be trunked. At 68 signal pins or balls per 60 GB/s, we can trunk 8 of them and provide 480 GB/s of bandwidth between a processor and a coprocessor, and we still only need 544 balls. Processor or coprocessor connectivity to memory is limited to one UH port to and from memory as the four ports on the memory cannot be trunked; however, since the UH port consists of so few pins or balls, we can use four memories or eight memories and achieve a total memory bandwidth of 240 or 480 GB/s on 272 or 544 signal pins or balls.
With a Universal Host Port different types of memory can be used, and the CPU is agnostic to and independent of the memory type. If a more modern memory technology comes around (either denser, faster, lower latency, higher bandwidth, lower power or a combination of two or more of the above), no processor redesign is necessary. It also allows a combination of memory types to improve density while keeping performance high, and with extremely dense memory directly attached to the CPU new operating systems and file systems might be possible. If the interface to memory is independent of the memory type, then all it transmits is really a command and data for writes and a command only for reads. With that simple change the memory can be enabled to reorder commands such that average latency and bandwidth is optimized, and the queue depth is configurable. Cache coherency is fully supported.
SSRLabs’ UHP is based on proven technology and does not require any special processes, nor does it require clock rates in excess of 1 GHz. In fact, working solutions of the UHP have been implemented on a standard 28 nm TSMC process. HBM and HBM2 need a much more advanced process to work.
The difference is even more stark if we compare SSRLabs’ UHPs to Gen-Z. Other than XAUI Gen-Z does not have a single proven or working PHY, and at this point in time the highest bandwidth is a proposed PCIe Gen4 derived PHY that skips FEC. Neither the PHY nor the protocol nor encoding or scrambling have been proven to work. Even if it can be made to work, the current limitations are 25 Gbit/s per lane pair, and the limit of trunking at this point in time looks like it is set to 256 lane pairs. That would boil down to 32 * 25 Gbit/s * 8 bit/B, or 800 GB/s. However, that would need 256 pairs per direction, or 1024 pins or balls, not counting clock, GND or anything else. Again, SSRLabs’ UHP can achieve 960 GB/s (20% more bandwidth than PCIe Gen4-based Gen-Z) on 1088 balls with the flexibility to subdivide the ports into groups or even provide failover modes, and PCIe Gen4 does not do that.
SSRLabs’ UHP uses IPv6-derived addressing for a total of 128 address bits, which is vastly more than Gen-Z with only 64 local address bits plus an additional 12 bits of global addresses. Gen-Z reintroduces segmented addressing that the industry just got rid of with the introduction of 64-bit processors, so clearly again this is a step in the wrong direction. SSRLabs’ UHP can set any group member as a member of a coherency domain and will include it in a ccNUMA scheme. Coherency domain members will be treated preferentially in the memory for all RMW operations, and all other operations can be locked.
UHP supports a very simple yet flexible command set consisting of read, write, read queued, write queued, RMW or locked atomic cycles (lock is a tag that can be attached to any operation) plus a command word for internal operations. UHP supports full semaphore support. All writes are posted (“fire and forget”), and therefore to the processor there is no difference between our write and queued write, but internally in the memory they are processed differently. The queue depth can be set to any value between 1 and 256 if the application requires it, and for timing-critical operations a read or RMW can be tagged as non-queuable, so it will be processed in the sequence of arrival.
SSRLabs’ UHP could replace DDR4 and DDR5 as well as Gen-Z and augment both HBM and HBM2, and as it puts the memory Controller onto the memory ASIC, it offers greatly improved flexibility to the CPU. We do understand that HBM and HBM2 are going to be used a fairly large L4 Caches, and that at first seems to make sense but it requires the HBM/HBM2 die to be on the same substrate as the CPU or GPU, thus increasing the power consumption and heat dissipation of that ASIC. Both power and heat density will increase and become less homogeneous, which makes cooling more difficult. We believe that our approach of spreading out the power consumption and heat dissipation over multiple chips makes more sense.