Fundamentals of Reliable Flash Storage - Understanding Performance - GSA

1. Abstract

This paper explains the different dimensions that are used to classify performance in flash storage. It will highlight the difference between crucial performance numbers and explain why these numbers should only be considered as a rough point of reference. The reasons for this are the various factors that influence read and write performance separately. These are explained throughout the paper.

2. Introduction

Performance of storage systems such as flash memory cards or Solid State Drives refers to the amount of data that can be transferred in a given amount of time. It is often assumed that performance can be expressed in a single number. Product flyers frequently indicate the performance for a storage product valid under optimal conditions. Understandably, when comparing two products, this one number is generally used as a reference. However, this is far from sufficient. When examining data transfers more closely, the difference between write and read transfers becomes obvious. Furthermore, the size of the chunks and the randomness of the allocation units (LBAs) in which data is transferred is a relevant factor. In addition to that and unlike Hard-Disk Drives, SSDs vary in their performance over their lifetime. Consequently, they require unique performance measurement techniques along with an analysis of the specific use-cases and environmental conditions, such as temperature, to measure and compare the performance of one drive to another.

3. Write Speed

3.1 Sequential versus random write performance
The most common performance characteristics measured are sequential and random operations. If the performance is indicated with a single number, e.g. 550 MB/s, it usually refers to the maximum interface transfer rate or a sequential read performance, which is often the highest of all performance values. If a data sheet indicates two performance numbers, a read- and a write-performance, these usually refer to the sequential-read- and sequential-write-performance. Sequential operations access locations on the storage device in a contiguous manner and are generally associated with large data transfer sizes (e.g. 128 kB or more) while random operations access locations on the storage device in a non-contiguous manner and are generally associated with small data transfer sizes (e.g. 4kB). Fig. 1 visualizes the access patterns.

Fig. 1: Sequential vs. random write access

In the case of sequential access, the data throughput is much higher. The reason for this is that a lot of data is moved with very few commands. Therefore, the overhead in the flash memory controller is rather small. On the other hand, in case of random access, a high number of commands transfer a small amount of data per command. This results in a lot of overhead, i.e. management of the incoming commands, queuing and other tasks, which in return leads to a lower data throughput.

3.2 Fresh out of the box
A first misconception about performance is that it is invariant over time. Repeated measurements, even over a longer period of time, should lead to the same results – given the fact that the measurement is executed in the same manner. For flash memory, this is not the case. In fact, the performance of SSDs can vary over time due to fragmentation and the need for maintenance operations such as garbage collection or wear-leveling. The drives are sensitive to how much of their capacity is used and to their write-history. The first time in use, the so-called “fresh out of the box” scenario describes a state, where all memory cells are fresh or unused. All cells are in an erased state. Since cells have to be erased before they can be programmed, the “fresh out of the box” case is a scenario, where the incoming data can be written to the flash immediately. Consequently, the performance – to be precise, the write performance – is the highest for sequential and random write scenarios. Yet this level of write performance will only be achieved for a small fraction of the lifetime of a flash memory product – usually the first few hours or the first day of use. Therefore, this is a poor indication of the real performance of a flash memory drive. Once a certain amount of data has been written onto the drive, the performance drops – and the drive never in its lifetime regains the initial write performance.

The reason for this behavior lies within the architecture of the flash memory. Memory cells are arranged in strings, which are then grouped into pages, which make up a block. While data is written by a multiple of pages, it is only possible to delete a whole block. Furthermore, flash memory must be erased before it can be re-programmed. During the vast majority of the lifetime of a drive, when data is re-written (meaning: at any point in time, data has been written to this location before), the original data is marked as “invalid” and the new data is written to a different location. After all blocks have been written to once, the drive must read all the good data around the invalid data and move it to another location, where it is re-ordered and consolidated. The old or invalid blocks are then erased and new data is written to the newly freed-up blocks. This process is called “garbage collection” – an ongoing activity of re-organizing the stored data that is causing large volumes of traffic on the flash bus and limiting the performance of writing new data from the host. This also explains why, unlike for hard-disks, when writing to a SSD the sequential or random nature of the writes will affect future performance. Sequential writes will generally leave few large blocks of free space which results in less effort for the garbage collection. Random writes will generally leave many small blocks of free space which results in a higher effort for the garbage collection.

The variation between “fresh out of the box” and steady-state performance reveals that performance-testing for short periods of time will almost certainly not disclose the real performance over the lifetime. The meaningful measurement of the steady state performance requires a certain preconditioning that can be achieved by making sure that each page of the flash has been written at least once before performing the test.

3.3 SLC Cache
NAND flash memory stores data in memory cells, which are made of floating-gate transistors. At first, each cell has two possible states, so one bit of data is stored in each cell, a so-called single-level cell (SLC) flash memory. A multi-level cell (MLC) is a memory element capable of storing more than a single bit of information, usually two. Triple-level cells (TLC) and quad-level cells (QLC) are memory cells, which can store 3 and 4 bits per cell, respectively.

With increasing bits per memory cell, write speeds to the cells significantly decrease because the additional bits require more signal processing and error correction during writing (programming). The decrease in write speed to the cells results in a decrease in overall write performance of storage systems using flash memory cells with higher per-cell storage capacity.

To overcome this, a high-performance write buffer (cache) can be created within the flash memory that simulates high-performance SLC. This is done by specifically configuring an area of cells in MLC, TLC or QLC flash to only store one bit of data. This is called pseudo-SLC (pSLC) mode and immensely increases write speed to these cells. During write operations, data from the host system is first written to the high-performance buffer at accelerated speeds and then during the idle periods, the data is relocated from the buffer to the other (slower) memory cells. This mechanism notably improves the perceived speed of a flash memory storage device, since an amount of data up to the size of the buffer can be transferred at a very high speed. A typical buffer size is shown in Table 1 – it varies based on the capacity of the SSD, increasing with larger capacities.

Table 1: Typical Buffer Sizes

Under consecutive write operations with no idle time, the buffer will eventually fill up. The buffer size, therefore, determines the maximum duration of continuous write operations at accelerated speeds. At this point, the transfer will continue and the data will then be written directly to the slower memory cells not operating in pSLC mode. Depending on the technology of the cells and many other factors, the write speed of a whole SSD can plunge from 400 or 500 MB/s down to below 100 MB/s. It will continue to write at this speed until the sequential transfer is finished and until it had time to clear the buffer by relocating all data to the other storage cells. This usually takes around 30 to 60 seconds depending on the buffer size. The numbers in Table 2 shows that a measurement of 100 MB/s is still not the lowest. Certain consumer SSDs with QLC technology show sustained sequential write performance of around 80 MB/s only.

Table 2: Specification of one of the first consumer SSD with QLC NAND flash memory

n Fig. 2 a measurement of a sequential write transfer to a SSD can be seen. For the first few seconds, the write speed is around 190 MB/s. This could be a result of shutting down the drive with a full cache. Performance increases to around 480 MB/s after the cache is emptied. After roughly 24s the pSLC buffer is full again and the performance drops to below 400 MB/s. While this is not a poor performance in general, it does highlight the issue that the promoted write speed of over 500 MB/s is neither sustained over a longer period of time nor achieved on average. Also performance is certainly not deterministic.

Fig. 2: Write speed over time for an SSD attached with a SATA interface.

The behavior of a very large solid state drive with a considerably larger pSLC buffer can be seen in Fig. 3. The device sustains a write speed of approximately 480 MB/s over almost 550 seconds before if shortly plunges down to 90 MB/s.

Fig. 3: Write Speed over time for an SSD attached with a SATA interface.

3.4 Thermal throttling
Performance can also depend on temperature. The temperature of the chip inside the package is determined by two factors – the ambient temperature and the heat produced by the chip, which is mainly due to switching activity. This generally means the chip is active. Since the chip might be damaged by high temperatures, almost all the latest flash memory controllers have an internal sensor that measures the temperature. If it rises over a defined threshold, the chip reduces its performance to not heat up further and risk damage. This is called thermal throttling. While all flash memory controllers have such functionality, the threshold is often set at a different level. Depending on the power consumption and therefore the heat dissipation of the chip itself, the reduced activity after reaching the temperature threshold leads to different performances.

In Fig. 4 the behavior of a competitor SSD can be seen. The drive is continuously filled with data in a sustained sequential write transfer. The internal temperature measurement reads 28 °C in the beginning. After a transfer time of a little more than 8 minutes, a temperature of 72 °C is reached, which triggers the thermal throttling to become active. As a result, the write speed frequently drops down to as low as 50 MB/s. In average, the write speed is close to 250 MB/s over the whole period in which the drive´s internal temperature is 72 °C. While this already indicates a significant drop in performance at this temperature, there are other drives on which the impact of thermal throttling is even worse. In Fig. 5 the measurement of another drive is depicted. The initial write speed is around 420 MB/s. It can be sustained for about 13 minutes, after which the internal temperature reaches 67 °C and the write speed decreases to roughly 105 MB/s.

Fig. 4: Write speed over time of a competitor SSD (SATA) in a scenario of sustained sequential write.

Fig. 5: Write speed over time of a competitor SSD (SATA) in a scenario of sustained sequential write.

When looking at the graphs, it is a valid argument to say that a sequential write transfer of dozens of gigabytes is not a typical workload. Nevertheless, the issue with temperature is still the same. In this test, the drive was connected and then tested immediately. This means that the drive was at room temperature before beginning this test. If it had been running in an industrial system for hours, days or months, the internal temperature would have risen to 40 or 50 °C before starting the test. Considering the industrial temperature range from -40 °C to +85 °C it becomes clear that both drives would only deliver a fraction of their advertised performance in the upper operational temperature range from above 72 °C to 85 °C.

4. Read Speed

NAND flash can be read much faster than it can be written to, almost independent of technology. The write performance depends on program times of different flash technologies, caches, garbage collection and basic readiness of the flash to be written to. Read performance on the other hand mainly depends on the ability of the controller or flash translation layer to identify the data location and on the quality level of the data representation in the flash arrays.

4.1 Sequential and random read performance
Somewhat analog to the sequential and random write transfers, in the case of random read there is additional effort for the flash translation layer to locate the data. When the host system requests the data of a certain address, the memory controller has to look up in its mapping table where this address is physically stored in the flash memory. In case of sequential reading, the next address that will be requested is already known and its physical location can already be determined. Another factor is the lookup of the physical location in the mapping table itself. The table is built up in a tree-shape, starting at the trunk. The location of data with similar addresses would be located on one branch. Searching for further locations from there on the same branch is a lot faster than searching for completely different addresses for which it is necessary to start at the trunk again. In addition to that, for sequential operations, concepts like prefetching to get the data of the next address can be utilized.

4.2 Sequential and random preconditioning
If data has been written sequentially and is read in the same sequence, the locating effort is minimal. Hence this is the best case for read performance. When data that has been written randomly is read sequentially and when data that has been written sequentially but is read randomly, both cases alike, the system shows much lower read performance compared to the pure sequential case.

4.3 Cross-temperature read performance
When data is read at a significantly different temperature than it has been written at, the likelihood of bit errors and the need to calibrate read voltage levels increases. As an effect, the read performance deteriorates dramatically.

4.4 Performance over lifetime
Very much like the write performance, read performance changes over the lifetime of the flash memory. The major reason is the errors during the read-out of the flash that have to be corrected by the flash memory controller. Naturally, the number of errors increases as the memory reaches its end of life. In the beginning, when there are few errors, the data can be transferred without any further processing. As time progresses, the data from the memory contains more errors which are corrected by the error correction module. The necessary calculations for the correction take time and introduce latency. With increasing amount of errors, the complexity for the correction increases, as does the time needed.

At some point, the error correction is not capable of correcting all errors in the data anymore. In this case, most controllers perform a so-called “read-retry”. As the name implies, the data is read again from the flash. This has a significant impact on performance. Compared with the optimal case, where data is read and can be transferred directly without any further processing, in this case the data has to be read, processed in the error correction with maximum processing length, then read again and probably also processed rather long, since after a read-retry it is likely that the amount of errors is close to what can be corrected. In total, the read process for a single block of data, in this case, takes up about four times as long, which in turn leads to a drop of the read performance to 25% for the block of data.

In the latest generations of 3D flash, where errors are more frequent than in older generations, the error correction uses a mechanism called soft-decoding for data that cannot be corrected. Instead of reading data just as zeros and ones, it also reads out the likelihood of it being a zero or one to aid in the correction process. The soft-decoding takes up to ten times of a normal read process. Consequently, one soft-decoding process heavily impacts the overall read performance.

4.5 Read performance and reliability
Another factor that influences read performance is read disturb management. During the read process of flash memory cells, neighboring cells are “disturbed”. Reading one cell frequently will at some point compromise the data in surrounding cells. To prevent this, read-counters keep track of the number of times each cell is read. When one cell is read continuously over a long period of time, it is needed to write the read-counters of cells onto the flash. Obviously, as the flash memory can´t be read and written to at the same time, this will briefly interrupt the prolonged read and therefore slightly reduce performance. This is not a mandatory process. To reach a maximum performance, it is possible to not perform this at all. This however, highlights that a maximum performance can sometimes be associated with a reduced reliability and safety of the data.

5. Conclusion

There is a level of complexity when measuring and assessing performance of SSDs. This paper explains the difference between the “fresh out of the box” performance and the steady-state performance and that only the latter should be considered as indicator of what to expect from a SSD. In addition to that, the issue of MLC, TLC and QLC NAND flash delivering lower write performance was discussed and how this can somewhat be ameliorated by using a pSLC cache. The limitations of such an approach in case of sustained write transfers were outlined. At last, the possibly strong effects of thermal throttling on performance were examined.

In conclusion, all these aspects contribute to the fact that performance and its measurement is a complex subject and that it is essential to understand exactly what has been measured under which conditions to be able to compare different drives. Numerous indications in advertisements and data sheets are under conditions far from reality and for use-cases that are very different from the target.