datacenter3

Database Fatal Flash Flaws No One Talks About

Flash storage has been touted for several years now as the answer to inadequate database performance. There are some very good reasons why this is so. Flash storage is generally much faster than hard disk drives (HDD). It has much lower latency, much greater write IOPS, throughput, and read IOPS. Therefore it must fix database performance problems. Not so fast. Sometimes it does and sometimes it does not. There are several flash issues such as steady state performance, write amplification, wear life, and electron leakage caused high bit-error rates that can sabotage the performance gains while overwhelming the budget. These issues (and several others as well) are rarely discussed, except after the fact when performance is no longer meeting requirements.

This article examines these issues in detail looking at how they specifically affect databases and the structured applications that depend on those databases. There will also be discussion about the pros and cons of common workarounds.

SQL Database Performance Issues

SQL databases are typically IOPS (inputs and outputs per second) intensive. Meaning that they require the compute infrastructure to deliver high amounts of IOPS to deliver a consistent satisfied user experience. SQL database IOPS requirements are rapidly increasing because of application proliferation dependent on those databases. Those IOPS can be delivered via memory or storage. Memory or DRAM is a very expensive option although justifiable in some cases. Storage is much less expensive, delivers much fewer IOPS, therefore requiring a lot more of them to meet modern database requirements. This puts high stress on the storage infrastructure.

How DBAs Frequently Respond

Database administrators (DBA) have been endeavoring to cope with the IOPS performance growth for years by delving into their bag of software performance enhancement tricks. Most will try to fix the IOPS performance problem before adding or changing their storage hardware.

Tuning

One of the first things they attempt is HDD I/O tuning. HDD I/O is optimally sizing SQL database files and then placing them on different LUNs and RAID groups. Doing this delivers the maximum hard disk drive subsystem throughput. But it requires the DBA to possess extensive SQL database as well as storage subsystem knowledge. Next up is application tuning, which is another way of saying recoding the SQL. Application tuning is a time consuming, labor-intensive task. A common DBA tuning trick is to pin certain data into memory or cache. Doing so will reduce trips to the disk subsystem and decrease query latencies. Another tool is memory tuning or “right sizing” the SQL database buffers. Although, this one tends not to be used as often because of its hit or miss nature. Right sizing SQL database buffers is as much art as science. It comes from experience and DBA intuition based on “wait” events, buffer hit ratios, system swapping, and paging.

Each of these tools is dependent on the DBA’s inherent expertise. All of them take effort, labor, and time. And far too frequently these tuning tricks just do not achieve the desired performance gains. When tuning fails to address SQL database performance issues, the next step is to spend money. The DBA can scale up the database, scale out the database, or buy higher IOPS flash storage.

Scale up

Scaling up is the process of moving to a bigger, more powerful server for the SQL database such as 64, 96, 128, or 256 processors with more memory, IO, bandwidth, PCIe slots, etc. It’s simple but with a high cost in both capital expenditures and operating expenditures. Besides more expensive hardware, there’s the additional software costs. SQL database licensing escalates rapidly because they’re based on cores. Scaling up assumes processing is the performance bottleneck and that’s rarely the case.

Scale out

Scaling out is the next approach and has two distinctive options. The first option is based on spreading queries across multiple unique database instances. It requires more instances of the current database with a twist. There are several variations of this technique including:

  • Sharding (the process of dividing database data along a specific application boundary among multiple database instances);
  • Read only slaves (writes are dedicated to a master database then replicated or mirrored to the read slaves where all reads are routed);
  • Peer-to-peer replication (only used when database updates are infrequent, maintains multiple database copies where replication is required to update the other database engines when a change has been made);
  • Linked servers and distributed queries (local database queries remote databases as if they were part of the local database);
  • Distributed partitioned views (ideal for update intensive apps, the primary SQL database table data is partitioned among tables in numerous distributed databases based on a partitioning key);
  • Data dependent routing (partitions the data to multiple databases while placing responsibility on the application or middleware to route the queries to the correct database).

Spreading queries across multiple unique database instances is not for amateurs requiring wide-ranging DBA knowledge, skill, expertise, as well as continuous adjustments and tuning.  The underlying database performance problem assumption is that it’s the result of database size or scale. Sometimes that is the case. And when it is, it can sometimes be solved by spreading queries across multiple unique database instances leveraging the additional compute nodes and database instances.

Option two is the implementation of clustered or distributed database. Both utilize multiple commoditized servers. Clustered SQL databases are a shared everything architecture. It creates a bigger database via the use of clustering. Distributed SQL databases are a shared nothing architecture built on grid technology (the underpinnings of cloud technology) for elasticity and scalability. They tend to require less DBA expertise and budget than multiple instances of standard SQL databases; however, it’s a rip out and replacement of those traditional databases. And once again, the underlying database performance problem assumption is that the bottleneck is not enough compute power. When that’s the case clustered or distributed SQL databases can resolve the performance issue.

All of the DBA issues discussed up until this point have been manually labor-intensive options. They’re time consuming, often frustrating, and don’t always solve the problem. This is why so many DBAs have turned to flash storage as the answer to their performance issues.

Why Flash Storage

Flash storage (a.k.a. NAND) drives produce thousands to millions of IOPS in a small package. Flash storage is fast producing as much as 3 orders of magnitude more random IOPS than Enterprise HDDs. It also draws much less power and cooling (reduced operating expenses) than HDDs because there are no moving parts, does not require an entire flash drive to be rebuilt when a block fails, and does not fragment as HDDs do eliminating defragmentation requirements. But it’s those incredible IOPS that make flash storage of all kinds so attractive to the DBA. Just as importantly, flash storage does not preclude the use of any of the other performance enhancing efforts just discussed. Flash storage is in fact SQL database methodology agnostic. It can and does for the most part instantly increase the SQL database IOPS.

SQL database flash storage is implemented as either host flash storage drives (DDR3 DIMMs, PCIe cards, or SAS/SATA solid state drives (SSD)), or as a SAN block (iSCSI and/or Fibre Channel) connected or NAS (NFS or SMB) file system connected flash storage system. Flash storage systems come in hybrid storage systems (mix of SSD and HDD) and all flash arrays (AFA). Both flash storage drives and flash storage systems have their pros and cons.

General Host-Based Flash Storage Drives Pros and Cons

The host based flash storage drives SQL database “pros” include: lowest SQL database to flash storage latency; simple implementations/operations; and relatively low $/IOPS.

The host based flash storage drive SQL database “cons” include: limited scalability; limited ECC; limited DRAM; NAND chip quality; cost in $/GB capacity; TCO is high; caching* often limited to write-through (read) caching; write-back caching creates shorter wear life and greater write amplification; difficult to impossible pinning SQL database indexes and/or hot data into the flash storage cache; and flash storage availability, resilience, continuity, and data protection are nominal at best. In addition, flash storage in one host is not addressable from another without additional costly licensed software that clusters the hosts and storage.

General Flash Storage Hybrid System Pros and Cons

Flash storage hybrid system SQL database “pros” include: high performance IOPS for active data with lower cost storage of inactive or cold data (Dragon Slayer Consulting surveys put inactive data as > 90% of all IT organizational data); inline deduplication and/or compression increases both flash and HDD usable capacities with nominal impact on latency and performance even improving performance by keeping more data in cache; excellent total usable capacities; exceptional data protection and resilience (zero space snapshots, clones, replication, and ECC); relatively low overall $/IOPS; relatively low cost in $/GB.

Flash storage hybrid system SQL database “cons” include: inline dedupe may have a noticeable negative impact on SQL database latency and performance; there are several hybrids with somewhat constrained usable flash capacities; hybrids that do flash caching ordinarily use only write through (read) caching; may not be able to pin or fix SQL database indexes and/or hot data into the flash storage cache; storage tiering forces all writes to start on flash increasing write amplification and shortening wear life (same issue with write caching); $/IOPS is lower than HDD systems and higher than AFAs; $/GB is lower than AFAs and higher than all HDD systems.

General All Flash Array (AFA) System Pros and Cons

AFA system SQL database “pros” include: large amounts of total storage system IOPS; inline deduplication and compression increases both flash usable with nominal impact on latency and performance; fair to quite good total usable flash capacities; exceptional data protection and resilience (zero space snapshots, clones, replication, and ECC); best overall $/IOPS; okay but improving cost in $/GB especially when dedupe and compression are taken into account and the capacity is usable capacity.

AFA system SQL database “cons” include: inline dedupe may have a noticeable negative impact on SQL database latency and performance; maximum flash usable capacities/scalability lags behind hybrid and HDD storage systems; write amplification tends to be higher than hybrids that use write through cache creating shorter wear life (same issue as write caching or storage tiering); $/GB is currently higher than all other storage systems (but the gap is shrinking).

Flash Storage SQL Database Surprises

DBAs find that while flash storage of all kinds is initially an IOPS performance godsend that performance frustratingly declines almost immediately degenerating fairly rapidly until it levels off at a much lower level. The test performance is higher than the production. This is the result of planning to first out of box performance versus steady state. Anyone with a flash storage based laptop experiences this. That and several other flash storage performance and data corruption issues from electron leakage, read disturb, and cell failures will reduce SQL database performance as well as stability if they’re not properly handled.

Database Fatal Flash Flaws

Most non-storage experts have a lot of misconceptions about flash storage. From how data is programmed (written to), how it’s erased, even issues caused by excessive reads are an annoying surprise to most DBAs. Flash caching is yet another area of confusion. To understand the flash issues that cause DBAs heartburn requires a brief level set on how flash storage actually works.

How Flash Storage Actually Writes and Reads Data

Flash storage is a non-volatile storage medium. Data can be written to it (programmed) and erased from flash storage. The two different types are NOR and NAND. Based on price/performance reasons, today the vast majority of commercial flash storage is NAND. Data storage in general is recorded as binary zeros and ones. For HDDs it’s based on the magnetic polarity of plus or minus. Flash storage is different and non intuitive in how it captures and changes data. It’s electrical storage not magnetic storage. Flash states are based on voltage levels within a cell measured as bits per cell. One bit per cell is call single level cell (SLC). Two bits per cell is called multi level cell (MLC). Three bits per cell is call triple level cell (TLC). SLC has two unique states per cell (0,1), MLC has four unique states per cell (00, 01, 10, 11), and TLC has eight unique states per cell (000, 001, 010, 011, 111, 110, 101, 100). The amount of voltage required reading a cell increases with the number of states. This has a huge impact on the flash performance (latency), bit error rates (BER), and life expectancy or program erase cycles (P/E) also known as write erase cycles. P/E cycles are the number of times a flash cell can be written to and erased before it essentially stops functioning. It’s colloquially known as flash “wear life”.

Unlike HDDs, flash stored data cannot be changed in place. This is because of the way flash storage programs/writes and erases data. Data is written in pages of 256 bytes, 2K bytes, or 4K bytes plus a few bytes for storing error correcting code checksums. Pages are combined into blocks. Blocks are commonly 16KB, 128KB, 256KB, or 512KB. Data is written randomly in bytes but can only be erased in blocks. In fact, once a block has been written to it cannot be written to again until it’s erased. That means if only 2KB is written to a 512KB block, the remaining 510KB is inaccessible until the whole block is erased. Each time a block is erased the wear life is shortened. That unused capacity now has a shorter wear life. Known as write amplification, it’s just one flash idiosyncrasy that has serious database consequences.

Flash Idiosyncrasies, Issues, or Flaws That Are Often Fatal to SQL Databases

There are seven key flash idiosyncrasies, issues, or flaws that can have a meaningful negative effect on SQL database performance. They are: bits per cell tradeoffs, P/E cycles (wear life), write amplification, write cliff, steady state performance (versus first out of box performance or FOB), read disturb, and high bit error rates.

Bits-per-cell tradeoffs are determined by the flash storage being SLC, eMLC , MLC, and TLC. More bits per cell results in lower cost flash storage as well as lower performance. As bits per cell increase so does the voltage required to program and read the cell. Increased voltage adds latency, increases bit errors, and reduces flash storage wear life. SLC has highest performance, lowest latency, longest wear life (measured in P/E cycles) and highest cost per GB, whereas TLC the slowest performance, highest latency, shortest wear life, and lowest cost per GB. The eMLC and MLC flash storage fall in between with eMLC tending towards SLC and MLC tending towards TLC.

SQL database performance consequences: SQL database performance will obviously vary by the flash type utilized. But there is a value tradeoff of cost versus performance measured as cost/IOPS that must be taken into account. In addition, at some point the SQL database performance bottleneck will be pushed into the storage controller or the SQL database server resources making additional flash IOPS impossible to utilize.

P/E cycles (wear life) are determined by flash type (SLC, eMLC, MLC, TLC). SLC is typically rated between 100K and 1M P/E cycles, eMLC from 30K to 40K P/E cycles, MLC ranges between 3K to 10K P/E cycles, and TLC from 500 to 1K P/E cycles. Wear leveling algorithm techniques in the flash storage drive help mitigate the issue; although, it often requires more sophistication and processing power found in a storage array system for best results.

SQL database performance consequences: Flash storage wear life can be a major problem for SQL databases when it comes to data reliability. Cells, pages, and blocks that wear out can cause data loss that destabilizes the SQL database.

Write amplification shortens the flash usable capacity wear life because unused capacity is consistently erased without ever being programmed or written to. There tends to be a series of tradeoffs when it comes to write amplification. Smaller flash storage blocks reduce the amount of waste and write amplification. The downside being a much greater number of blocks for the flash drive controller to track for data obsolescence (data no longer valid), being marked for erasure (a.k.a. garbage collection), then being made available in the pool of writeable blocks. That additional processing adds latency, reduces IOPS, and requires additional flash storage controller DRAM. Large block sizes reduce flash drive controller processing and latency but has higher wear life. The better bet is when the flash storage array system caches writes to flash in DRAM and serializes the data to cache. Doing so very nearly eliminates write amplification. It also enables costly system DRAM to be shared across all system flash (ECC) that often only occur in flash storage systems.

SQL database performance consequences: Reduced write amplification is good, but the extra cost and higher latency diminishes SQL database performance. Increased write amplification will cause faster cell, page, and block failures leading to data loss and corrupted databases. DRAM caching has the best potential of significantly reducing write amplification while increasing SQL database performance.

Write cliff occurs when the flash storage performance drops suddenly, and permanently, all at once over time. This is predominantly a flash storage drive issue and not so much of a flash storage system issue. It can still be a flash storage system issue when the system’s flash capacity consumption exceeds 90% and the garbage collection process has becomes more frequent while working harder and slower. Write cliff is far more prevalent with flash storage drives generally and flash storage systems not architected specifically for the idiosyncrasies of flash.

SQL database performance consequences: Write cliff will show up as a sudden rapid database performance decline that does not recover.

Steady-state performance vs. First-Out-of-Box (FOB) performance is the state of the flash storage when it has experienced enough P/E cycles to stabilize the write performance to a predictable sustained level. Enough P/E cycles are generally when most of the flash storage blocks have been written to at least once.  FOB performance is the flash storage’s best performance. The gap between FOB and steady-state performance is non-trivial often approaching an order of magnitude (i.e. steady state is approximately 10% of FOB.)

SQL database performance consequences: A failure to plan SQL database performance around flash storage steady state will lead to the unpleasant experience of either living with lower SQL database performance than expected or spending more money on additional flash storage than was budgeted.

Read-disturb occurs when reads of a NAND cell causes data to change on bordering or nearby NAND cells within the flash block. Since all cells on flash NAND chips exist on the same die there is occasionally cross coupling between bordering cells in a block. The energy required passing through a cell to read its state periodically bleeds enough of a charge off an adjacent cell that pushes that cell’s state past its threshold changing its state.  Read-disturb is more likely to occur as the bits per cell increase because more energy is required to pass through the cells.  Because the reads required to cause a read disturb is measured in 9 figures it’s not that frequent however, when it occurs data is changed or lost.  Flash storage drives and systems attempt to prevent read disturb by setting a block read threshold since last erase cycle that copies the block to another erased or unused block resetting the counter.  Correcting read-disturb occurrences requires more sensitive error detection and correction codes (ECC) that often only occur in flash storage systems.

SQL database performance consequences: Read disturb is more likely to occur in databases where the same data has an exceedingly high read rate and the flash storage does not have safeguards built in. When it does occur SQL databases will most likely become corrupted.

High bit error rates (BER) are endemic to flash storage. Error correction codes (ECC) are used to monitor and correct errors. The good news is that as NAND flash continues to shrink (now at 19nm heading to14nm and 10nm) price per GB continues to decline. The bad news is that NAND die shrinkage causes BER to increase geometrically requiring ECC algorithm capabilities, sensitivity, and complexity to increase right along with it. This in turn requires greater amounts of the flash storage controller’s resources to manage that ECC. Therefore as NAND dies shrink, cost per GB drops, but overhead and IOPS latencies rise. Many flash PCIe and SSDs do not adjust their ECC algorithms enough to manage the increased BER requiring the storage system or server to handle it.

SQL database performance consequences: Uncorrected high BER will cause corrupted SQL databases. Today 19nm NAND flash requires 41 ECC corrections per 1KB of data.  Future advances to smaller die sizes will require even more ECC corrections. Many flash storage controllers at the drive level do not have that capability.  Some do, whereas most flash storage systems also have more extensive ECC capabilities.

Flash storage is an outstanding SQL database performance storage technology if implemented correctly and while addressing each of these flaws effectively. Otherwise, it is likely to cause dashed expectations and severe SQL database operational and financial headaches.

Eliminating The Bits-Per-Cell Tradeoffs

Well-designed flash arrays eliminate the SQL database price/performance tradeoffs by making the lower-cost MLC flash drives perform and behave similarly to SLC. It speeds up MLC flash performance through clever use of the DRAM. DRAM is up 10 times faster than SLC flash. These arrays utilize the DRAM to coalesce the writes to sequentialize them to the write flash. That sequentialization enables larger block sizes (more on that in the next paragraph), meaning fewer blocks to track per flash storage drive, that reduce both write and read latencies from flash storage drives improving performance. DRAM coalescence also increases the MLC wear life. SQL database latencies have a consistent 1ms or less (better than most SLC flash storage) with the cost of MLC, thus eliminating the price/performance tradeoffs.

Reducing Write Amplification Increasing Wear Life

DRAM coalescence eliminates the random nature of writes to flash that is so prevalent in SQL databases (server virtualization too). This allows the array to utilize larger block sizes with exceptionally high efficiencies (i.e. minimal unused pages in blocks) making write amplification a non-issue while delivering very high wear life and endurance.

Eliminating/Mitigating Write-Cliff

A well-designed flash storage system will have built-in software, processing power, memory, and capacities that will make write cliff a non-event.

Steady-state performance

Well-designed flash storage arrays increase the ramp to process to steady-state flash performance through large flash capacities, intelligent use of data reduction techniques like deduplication and compression, as well as DRAM sequentialization of writes. That use of DRAM caching also means the steady-state issue does not affect writes.

Inline deduplication and compression has the added benefit of allowing up to 10x more data to reside in flash read cache. This allows the pinning of all of the very performance-sensitive portions of databases in flash. The result is significantly lower SQL database latencies, increased SQL database cache hits, reduced HDD paging, and much improved overall database performance. This clearly mitigates the impact of steady state.

Eliminating Read-Disturb

Well-designed flash storage arrays utilize the block-read threshold since last erase cycle that copies the block to another erased or unused block resetting the counter, thus eliminating read-disturb issues.

Correcting High BER

Well-designed flash storage arrays do a good job of detecting, correcting, and overcoming high levels of BER. It starts with the highly sensitive and extensive end-to-end error detection and correction codes from the I/O through the CPU to memory to flash and even on to HDDs and back. It even eliminates HDD silent data corruption through checksum matching with the blocks.

Additionally, good flash systems will include zero capacity instantaneous writeable snapshots, cloning, and replication. That data protection capability empowers the array to keep a good copy of the data in another volume or system should it not be able to autocorrect a BER.

In Summary

There are numerous flash storage issues that most DBAs simply do not know about.  Those issues can and do have a nasty impact on SQL database performance.  A well-designed flash storage array can eliminate or at least mitigate many of these issues. Be sure to ask your storage vendor about each of them before you purchase a flash storage array.

* Caching is a mandatory requirement when data must be shared on external storage such as when a hypervisor is involved.