Over the past  30  years  processor  development  has  been  on  a  tremendous  path.  Transistor densities rose according to Moore’s Law,  and  the  clock  frequencies of  processors and  their cores  are  now in the 4 GHz range. Due to  the vast number of transistors available on a reasonably-sized   die even processors intended for consumer devices have  at least 4 cores, and usually Level 1, Level     2 and oftentimes even Level 3 Caches. Unfortunately, the development of  memory,  particularly DRAM, has focused more on density than on performance, forcing the use of ever more levels of ever-larger caches to hide the disparities of performance between  memory  and  processor  cores.  While Synchronous Dynamic Random Access Memory (SDRAM) went from single Data Rate to Double Data Rate now available in version 4 (DDR4), the DRAM cells and the core clock frequencies have not risen at the same rate as processor clock frequencies. The DRAM  core  clock frequencies  have increased from 100 MHz to still below 500 MHz today, while the interfaces gained some performance, currently at 3200 MHz symbol rate. However, two  limiting  factors  remain:  the  interface is still SSTL-2, and the fact that the interfaces have a higher bandwidth and a marginally lower latency than 30 years ago, the number of wait states is still tremendously high due to  the fact  that the internal memory bandwidth is very limited. For all practical purposes, a DDR3 or DDR4 SSTL-2 interface delivers 17 to 21 GB/s of bandwidth in burst mode. That interface is  half-duplex,  and it does not support a request queue. In other words, if reordering should (or must) occur,  the  DRAM Controller has to execute that function. However, the DRAM Controller is on the CPU die (part of the “uncore”). DDR3 and DDR4 are fairly wide buses and require 208 pins or balls, and that limits total processor I/O. Four of those interfaces take up 832 pins or balls, and six of them will require  a total of 1248 pins or balls. While dice can be made to  support  several  thousand  bumps,  the  packages for cost reasons are limited to around 2500 balls or pins. As a result, six DDR3 or DDR4 interfaces already take up about half of all pins  or balls  that are  available  on a  package, and  the  total I/O achievable is around 100 – 120  GB/s.

Memory 1

Figure 1: Schematic of current multi-core processor

If an even higher level of I/O is required, then at this point in time on existing processors that I/O is limited to communication between two or more ASICs on a single substrate. There is currently no processor in existence that provides an I/O bandwidth  that  exceeds  200  GB/s  off-chip.  On-  substrate bandwidth today is limited to about 1 TB/s and it is limited to package-internal devices such as a GPU and an HBM or HBM2 DRAM ASIC. While that of course increases the processors’ (or GPU’s) total I/O bandwidth, the HBM or HBM2 module essentially becomes an L4 Cache.

As a  result, we now have  a situation in which processors  have  vastly outgrown the ability of the  main (i.e. random access) memory or DRAM to deliver the required  data.  This  would  be  bad  enough if the computational requirements had remained the same, but unfortunately we observe a variety of trends that indicate that our appetite for CPU cycles increases at an exponential rate. Big Data, “The Cloud”, Machine and Deep Learning as well as Artificial Intelligence on top of the ever- growing needs  for traditional HPC require  new  approaches.

Memory 2

Figure 2: Current server with Flash on the DRAM Bus and with SSD and NVMe

It is apparent that the Flash on the DRAM bus will need to understand and  execute  DRAM  commands. Alternatively there will have to be a converter from DDR3/4 DRAM to (ONFI) Flash, but in either case there is going to be additional latency. The problem with limited pincount is not solved either. Density is also still an issue, and as a result, there are going to be PCIe-attached Flash and conventional SSDs in those servers.

Hyperconverged Servers – History and Architecture

Initially, supercomputers were very different and distinct from any regular computer, and they were based on special (mostly GaAs) processors. They differed from minicomputers, mainframes and from PCs, and they did not deploy any processor used in any of the other computer categories. At that time, internal interconnects were vastly faster (both in bandwidth and in latency) than anything a network could offer, so compute and storage were in the same computer. A supercomputer would contain not only the compute subsystem, but also the storage subsystem. The same  was  true for minicomputers and PCs and of course PC-based servers.

It turned out that by not building special supercomputers and special processors for those supercomputers, and instead using networked industry standard PC-servers, more MIPS and FLOPS could be bought for the same price. These new supercomputers were PC-based servers connected via the fastest interconnects that were available. That worked out reasonably well, but the storage tasks interfered with the compute tasks in a negative way. The discrepancy between the performance   of   the   compute   tasks   and   the   available   disk   I/O   kept   growing.   The   System Administrators of this type of supercomputers profiled typical loads and tasks, and they found that the compute tasks were held back and in general waited for disk I/O for a good portion of the time. Hard disks were very slow, tapes were even slower and data caching and management was  needed. Disk caching and storage management was put as software onto clusters of industry- standard servers to take over all storage management tasks. Those tasks were computationally moderately intensive and only required large memory on processors that were cheaper than the compute cluster processors.This allowed supercomputers to be logically split into two different partitions, namely the compute clusters and the storage clusters. Compute clusters focused on compute, i.e. they contained lots of CPUs and accelerators and memory. They worked on the initial problem set that the supercomputer was designed for, and with the

This allowed supercomputers to be logically split into two different partitions, namely the compute clusters and the storage clusters. Compute clusters focused on compute, i.e. they contained lots of CPUs and accelerators and memory. They worked on the initial problem set that the supercomputer was designed for, and with the partitioning they were relieved of all storage functions.

Memory 3

Figure 3: Supercomputer with Compute and Storage Clusters

The storage tasks included primarily disk caching and other storage and storage  management tasks, including snapshots, de-duplication, tape operations, backup and restore as well  as  archiving and journaling. These tasks became more and more sophisticated and ultimately migrated into a set of industry-standard servers managing a very large number of hard disks, and that combination evolved into what is known as the storage appliance. Doing so cut down on cost while improving performance. This architecture worked well for as long as the network was faster than  cached  disk  I/O.  The  advent  of  Flash  in  the  form  of  SAS-  or  SATA-attached  SSDs  started  to change this. PCIe-attached storage provided levels of performance that network-attached storage simply could not match any more.
Oftentimes special network infrastructure was needed for the Storage Nodes, such as InfiniBand or FibreChannel. While those networks provided lower latencies, it forced the supercomputer (or data center) operator to set up and maintain two separate network infrastructures.

Memory 4

Figure 4: Hyperconverged Server with nearline storage

With Flash memory now being mainstream and offering a tremendous performance per Dollar spent oftentimes dedicated storage appliances are unnecessary and thus the  old  is new again except for the new name: the “hyperconverged server”. Flash is used in servers to improve density and performance. While density is ever-increasing, performance is limited by the infrastructure that connects Flash to the CPU. Current processors use DDR3 and DDR4 buses to connect to DRAM and Flash, and PCIe-enabled Flash for larger arrays. However, DDR3 and DDR4 are limited to about 17 – 21 GB/s of realistically usable bandwidth, and queuing and reordering must be done in the CPU- internal DDR3/4 Memory Controller if Flash is on the DDR3/4 memory bus, and within the NVMe module if Flash is on a PCIe port.

A very simplified form of a hyperconverged server is depicted in the following schematic.

Memory 5

Figure 5: A Hyperconverged Server

The following schematic shows the proposed solution for a processor that would be used in one of these hyperconverged servers.

memory 6

Figure 6: Schematic of proposed industry-standard multi-core processor

This does not seem to solve the problem as the number of bumps on the die again grows, the power  consumption grows and is even more concentrated in one device, and I/O has not really grown other than the added L4 Cache on the same substrate (in a multi-chip module).

Certainly with more cores available per processor, and more and larger caches the performance of the processor itself and of the computer will increase, and many more storage-related tasks can be executed on this processor, but the limitation of I/O including scale-out remains.

Limitations

The limitation in today’s computers are memory and peripheral I/O bandwidth and latency, namely DDR3/4 and PCIe. Any attempt to scale out the performance of a server is limited by these. With 6 DDR3/4 DRAM ports and up to 64 lanes of PCIe Gen3 a processor is quickly approaching the limits of what is economically feasible to provide in a package. The 64 PCIe lanes only support 4 16-lane devices, and that means that with a GPU and a 100 Gbit/s NIC being mandatory, the total peripheral  I/O  to  and  from  other  servers  is  limited  to  under  12  GB/s  through  the  NIC,  and  that leaves  only  two  ports  for  NVMe  with  a  total  of  2  *  15.75  GB/s  of  PCIe-attached  Flash  memory interface bandwidth. In other words, no matter how the problem is approached, the scale-out is limited to  the network bandwidth of around 12  GB/s (100 Gbit/s) or 15.75 GB/s on  PCIe per port. There simply is no infrastructure in current servers that supports any more scale-out. Even trading off NVMe Flash versus NICs does not solve the problem.

Both NICs and NVMe cards share the limits of today’s peripheral I/O infrastructure within the server. PCIe is a switched infrastructure that provides around 15.75 GB/s per port for Gen3 in a 16-lane configuration. As a result, it is blocking-free and shows no contention. However, each port requires  pins or balls on the  CPU or an equivalent NorthBridge. This brings  us back to  the problem  we saw in the CPU schematics. The dilemma is that no matter how we try to scale out, we are limited by the I/O and the infrastructure. That is true for storage, for accelerators and for clustering through NICs.

Just as an example for why it worked fine for hard disks and tape, but it starts to fail for SSDs and definitely does not provide benefits for PCIe-attached NVMe let’s consider a hard disk with a peak I/O bandwidth of 300 MB/s. Let’s use 8 of them in an array and assume that their I/O bandwidth is accumulative. Then we’d deal with 2.5 GB/s per server. That is easily dealt with by the server CPU  and won’t even start to saturate a single PCIe Gen3 * 8 lane port. An SSD can easily maintain 450 MB/s and start to saturate a SATA channel. 8 of them would peak at 3.6 GB/s but since even on  random access their latency is much lower than for a hard disk, the sustainable data rate would be  vastly higher than that of the hard  disk. Nevertheless, even that array would not saturate a  single  Gen3 16-lane PCIe port. As a result, the total performance would be higher if we used a server to  cache all disk accesses and connect that server to the compute node via 100 Gbit/s Ethernet. The nominal data  rate on  that interface  is at or around 10 – 12 GB/s, and that would start to saturate a  PCIe Gen3 16-lane port. If, however, we chose to deploy a PCIe card with 8 M.2 NVMe  cards built into it, we easily will saturate the PCIe port as each of the M.2 cards peaks at above 2 GB/s, and therefore 8 of them will saturate a Gen3 16-lane PCIe port.

In hyperconverged servers latency-sensitive requests to mass storage do not have to traverse the network any more, and that will save  some  latency,  but at the  same  time  disk caching is going to   be limited or eliminated. Disk caching is limited to the performance advantage that DRAM brings over PCIe-attached NVMe or arrays of SSDs; as we have seen, we can saturate a PCIe Gen3 16- lane port with 8 parallel M.2 modules, and that performance is very close to the performance of a  single DDR3/4 DRAM channel. With increasing performance of M.2 the value of disk caching will be severely diminished.

PCIe Gen3 in the 16-lane variant peaks at less than 16 GB/s, and that is the per-port limit that can be achieved using PCIe-attached Flash or network-attached storage. As a result, high-performance storage was again built into the compute nodes, and only the nearline and offline data storage on SATA disks and tape as a backup is now left on storage appliances. In essence, what used to be a performance-enhancing technology now has become simply a bulk storage and backup/restore technology, possibly with features for de-duplication, archival and journaling. This has caused the convergence of compute and performance-oriented storage, creating the hyperconverged server. Bulk storage and other non-latency-sensitive tasks such as de-duplication and backup/restore will still be the domain of storage appliances.

We  are presenting an infrastructure that is based on a standard interface and does not have these    limits. We are working with a general-purpose processor provider who is implementing 8 of our interfaces into their CPU, allowing 8 of our memory ASICs to be attached with a single hop,  for a  total of 4 TB of memory at a bandwidth of 480 GB/s. For systems with two or three hops the total memory size  can be  increased to  8 or 12  TB, again at 480 GB/s bandwidth.

SSRLabs’ UHP versus HBM and Gen-Z

SSRLabs has developed a Universal Host Port (UHP) that is capable of connecting processors to memory,  to  coprocessors  and  to  accelerators.  It supports a  coherency domain of any  desired size. It is a truly universal and unified port as it can be used as a source or  target and master as  well as  slave interface. It simplifies design and layout as it is capable of polarity and lane reversal internally. Since our ASICs all contain internal switch fabrics, the UHP can be used  to  scale  out,  and  no external switch fabric is needed. All ports of any of the ASICs we have developed can be  connected   in any fashion, without needing any external components. As  these  are  interfaces  intended  to connect to devices outside of the substrate, the bandwidth per pin is comparable to HBM or HBM2,
but substantially higher than DDR3/4 SSTL-2 based DRAM. Its latency is higher than HBM or HBM2, but again vastly lower than DDR3/4 SSTL-2 based DRAM. Compared to HBM with its nearly 1100 signal connections needed for 500 GB/s of bandwidth, SSRLabs’ UHP only needs 84 balls (16 lane pairs per  direction plus clock, or a  total of 68 signals) for 60 GB/s of full-duplex and bidirectional   fully queued traffic. 84 connections on a PCB do not pose a problem, but nearly 1100 signal connections for HBM or HBM2 do, and therefore all HBM or HBM2 is good for is essentially L4 Cache on the same substrate that the CPU or GPU reside on. SSRLabs’ vlcRAM is a 512 GB memory, and more importantly, SSRLabs’ UHPs can be trunked. At 68 signal pins or balls per 60 GB/s, we can trunk 8 of them and provide  480 GB/s  of bandwidth between  a  processor and a  coprocessor,  and we still only need 544 balls. Processor or coprocessor connectivity to memory is limited  to one  UH port to  and from memory as the  four ports on the memory cannot be trunked; however, since     the UH port consists of so few pins or balls, we can use  four  memories or  eight memories and  achieve a total memory bandwidth of 240 or 480 GB/s on 272 or 544 signal pins or balls.

Memory 7

Figure 7: UHP in a server

With a Universal Host Port different types of memory can be used, and the CPU is agnostic to and independent of the memory type. If a more modern memory  technology  comes  around  (either  denser, faster,  lower latency,  higher bandwidth, lower power or a combination of two  or more of     the above), no processor redesign is necessary. It also allows a combination of memory types to improve density while keeping performance high, and with  extremely  dense  memory  directly attached to the CPU new operating systems and file systems might be possible. If the interface to memory is independent of the memory type, then all it transmits is really a command and data for writes  and  a  command  only  for  reads.  With  that  simple  change  the  memory  can  be  enabled  to reorder commands such that average latency and bandwidth is optimized, and the queue depth is configurable. Cache coherency is fully supported.
SSRLabs’ UHP is based on proven technology and does not require any special processes, nor does it require clock rates in excess of 1 GHz. In fact, working solutions of the UHP have been implemented on a standard 28 nm TSMC process. HBM and HBM2 need a much more advanced process  to work.

The difference is even more stark if we compare SSRLabs’ UHPs to Gen-Z. Other than XAUI Gen-Z does not have a single proven or working PHY, and at this point in time the highest bandwidth is a proposed PCIe Gen4 derived PHY that skips FEC. Neither the PHY nor the protocol nor encoding or scrambling have been proven to work. Even if it can be made to work, the current limitations are 25 Gbit/s per lane pair, and the limit of trunking at this point in time looks like it is set to 256 lane pairs. That would boil down to 32 * 25 Gbit/s * 8 bit/B, or 800 GB/s. However, that would need 256 pairs per direction, or 1024 pins or balls, not counting clock, GND or anything else. Again, SSRLabs’ UHP can achieve 960 GB/s (20% more bandwidth than PCIe Gen4-based Gen-Z) on 1088 balls with the flexibility to subdivide the ports into groups  or even  provide  failover modes, and  PCIe  Gen4 does not  do that.

SSRLabs’  UHP  uses  IPv6-derived addressing  for  a  total of  128 address  bits,  which  is  vastly more than Gen-Z with only 64 local address bits plus an additional 12 bits of global addresses. Gen-Z reintroduces segmented addressing that the industry just got rid of with the introduction of 64-bit processors, so clearly again this is a step in the wrong direction. SSRLabs’ UHP can set any group member as a member of a coherency domain and will include it in a ccNUMA scheme. Coherency domain members will be treated preferentially in the memory for all RMW operations, and all other operations can be locked.

UHP supports a very simple yet flexible command set consisting of read, write, read queued, write queued, RMW or locked atomic cycles (lock is a tag that can be attached to any operation) plus a command word for internal operations. UHP supports full semaphore support. All writes are posted (“fire and forget”), and therefore to the processor there is no difference  between  our  write  and queued write, but internally in the memory they are  processed differently. The  queue  depth can be   set to any value between 1 and 256 if the application requires it, and for timing-critical operations a read or RMW can be tagged as non-queuable, so it will be processed in the sequence of arrival.

SSRLabs’ UHP could replace DDR4 and DDR5 as well as Gen-Z and augment both HBM and HBM2, and as it puts the memory Controller onto the memory ASIC, it offers greatly improved flexibility to the CPU. We do understand that HBM and HBM2 are going to be used a fairly large L4 Caches, and that at first seems to make sense but it requires the HBM/HBM2 die to be on the same substrate as the CPU or GPU, thus increasing the power consumption and heat dissipation of that ASIC. Both power and heat density will increase and become less homogeneous, which makes cooling more difficult. We believe that our approach of spreading out the power consumption and heat dissipation over multiple chips makes more sense.