Sunday, 19 July 2015

Flash Memory - SSDs disks - research continued

The notes from Rex Walters (Tintri) session VMworld 2013: "Flash Storage Deep Dive"

Flash is changing storage, but is very different than hard disks drives (HDD)

For long time tape drives were a primary storage. Tape drives are very efficient in streaming data. Hard drives were revolutionary because of random access.

In a future flash will play role of primary storage - HDD as a secondary storage.

Flash is fast if you use it correctly !

We have 50 years of experience how to use hard disk drives (HDD) 

NAND Flash & SSDs basics

Human-being learn how to use things making abstractions.

Floating gate -> NAND flash -> tunnel injections

SCL - Single Cell Layer
MCL - Multi Cell Layer - measure how much current is flowing (more states)

Flash is organised into pages 4KB or 8KB.

Not write-in-place - if you write data to SSD you don't know where this data is located, only SSD controller knows.

READING and WRITING  is organised into pages granularity.
ERASE is organised into "erasure blocks"

Inside every SSDs we have process called Garbage Collector (GC) which scans all the blocks and tried to find blocks with a lot of dead pages (space to reclaim) and GC move "live" blocks somewhere else and erase entire thing (block).

You can only do a finite count of ERASURES - wear-levelling

HDD or SSD - they are block storage devices

SCSI protocol was developed by Larry Boucher - before SCSI we access HDD using direct access through H/C/S Head/Cylinder/Sector.

SCSI protocol makes assumptions that HDD is a contiguous series of blocks.

SCSI is a complex I/O protocol mainly using:
READ(offset, count) and WRITE(offset,count)

SCSI is for serial access (tape and HDD) - let's make HDD looks like a tape ;-)

File is a contiguous sequence of bytes instead of blocks. we have byte granularity of access.

From VMware perspective we talking to the file (vmdk).

If we use VMFS the meta-data is on the host if we talk to NFS the meta-data is on the array.

If you want to keep the data safe you want to move it periodically, we store data with the some CRC and every time we access data we check CRC. What if we have 1 year old backup which never was restored ? What is a chance that data is corrupted?

Other SSD vs. HDD differences/implications 

* Asymmetric read/write performance
* wear-levelling, garbage collections and FTL - data always in motion
* Parallel I/O channels with RAM buffers
* Major implementation variations with every new version of firmware
* Flash replaces known failure mode with completely new ones.
* limited endurance (P/E Cycles)
* all SSDs behave different with different firmwares
* read disturbance - if you read the same cell over and over again,
the process can reprogram neighbours cells. 

The Latency and IOPS for HDDs:





  Average is a reality - Best never happened outside lab ;-)


For HDD the IOPS are the bottleneck but the biggest blocks the more throughput.



For SSDs the throughput is a bottleneck the block size doesn't matter.

 

Storage Systems must evolve:

* align to SSD pages
* keep meta-data in flash
* break up large WRITES (allow interleaved READs)
* anticipate power-loss at any time (checkpoints)
* checksum data and meta-data references
* verify on read and scrub (most data is 'cold' if you find corruptions after 6 months it is bad that's why we scrubbing more frequently).
* self-heal with multi-bit redundancy
* paranoid READS with de-dupe - strong cryptographic SHA-1

Conclusions:

* Understand how flash works: it's utterly different than rotating data
* Flash differs spectacular performance in-place, but flash fails in surprising new ways
* Design for strength and around the weakness - don't use it like a HDD !




 


 

Sunday, 12 July 2015

Flash storage - SSD disks - NV-Memory - RESEARCH & NOTES

I've focused on flash storage recently. Below my research and notes from different sources.

1.) HotChips conference - notes from Jim Hardy 'Flash Storage'presentations:

By 2020 we will have :

~ 35ZB data (according to CSC)
~ 50 billion things on internet (according to Cisco)

* If CPU access is as a heartbeat, DRAM access is as walking a mile, HDD access is as bicycle from San Francisco to Miami.

* When you put SSD into Storage Array you still have network/SAN fabric delay. If you put SSD into a server you accelerate the server but causing some problems with coherency.

* Other Flash benefits:
- less power & cooling
- improved reliability: fewer things to fail
- floor space reduction
- lower licencing fees (consolidate servers)
- faster error recovery:
-- RAID rebuild
-- backup restore
--snapshots

* Samsung went (2013) with 3D Flash.

* NANDs scaling limits:
- too few electrons per gate
- needs constant shrinking for cost reduction
- 4bits/cell hard to make:
-- this may be the max possible
* other technologies will scale past NANDs
-- PCM, MRAM, RRAM, FRAM ...
-- not yet clear which one will win...

NAND doesn't do 'byte' writes, it does 'page' writes, but it also doesn't overwrites. You have to erase first and write new data. When you moving data from block to block you have to think about wearing.

New NV-memories are better than NAND:

NAND:
* serial read
* erase before write
* block erase/page write
* slow writes
* inherent bit errors
* wear

New NV-memories:
* random read
* overwrite
* byte write
* fast write
* lower bit errors
* low/no wear

Future: Storage Class Memory

NV-memories won't cross HDD $/GB !!!!
New NV-memories will required new computing architecture

2.) HotChips conference - Krishna Parat 'NAND Technologies' presentation:

NAND Flash key attributes:
* non-volatile memory
* READ access time of 10's of microsecond
* WRITE/ERASE time of ~ milisecond
* page programming
* block erase
* 10's of thousen cycles WR/E endurence
* NAND has a simple cell and array structure

Program => Electrons stored on the Floating Gate -> High Vt

Erase => remove electrons from Floating Gate - Low Vt

Read => look for current through the cell at given gate bias

* Programming is by tunneling electrons through the Tunnel-ox by applying a high gate voltage and grounding well and source/drain.

* Erase is by tunneling electrons through the Tunnel-ox applying a high Well voltage and grounding the gate (P-Well - where electrons flow)

* Cell area has scaled per Moore's law predictions into the mid 20nm without much difficulties.

* NAND cell scaling issue at 20nm 'interreference concerns' with wrapped cells.

Planar Floating gate cell with High-K/Metal gate successfully overcomes the scaling hurdle for 20nm and beyond.

* Program/Erase voltages ~20V !!

* Large voltage + small pitch = Reliability Risk
* At ~10nm WL-WL electric field ~ 10MV/inch !

2D NAND -> 3D NAND
* vertical NAND string
* conductive or dielectric storage node
* deposited poly-silicon channel
* large footprint of the 3D cell with required quite a few layers to be stocked to achieve effective cell area scalling
* increase demands on process technology very high aspect ratio etches & fills.

Computing 3D cell structure options:
* vertical string vs. horizontal string
* vertical is more attractive for electric properties
* horizontal string is more attractive for cell size
* either cases will lead to increase block size

Summary:
* NAND flash has a simple array structure which has been highly amendable to scaling
* NAND flash leads the industry in scaling
* Lithography induced scaling limits were overcome using advanced pitch reduction techniques.
* Interferences issues were contained through interpolation of air-gap at critical locations.
* Wrap cell limits were overcome with planar Floating Gate cell using High-K dielectric/Metal gate.
* 2D scalling can continue into the mid to low ~10nm ,scalling beyond can come from transitioning to 3D.

3.) HotChips conference - Amber Huffman 'PCIe Storage' presentation:

* NVM express:
PCIe SSD are emerging in Datacenters/Enterprises co-existing with SAS/SATA depending on applications.

* PCIe great interface for SSDs:
-- 1GB/s per lane (PCIe G3 x1)
-- 8GB/s per device (PCIe G3 x8) or more
-- low latency - platform adapter: 10microsec down to 3microsec
-- lower power
-- lower cost (?)
-- PCIe lanes off the CPU 40 G3 (80 in dual sockets)

* NVM express is the interface architected from the ground up for NAND today and next generations NV-memories 

NVMe express technical basics:
* all parameters for 4KB command in single 64B command
* support deep queues (64K commands per queue, up to 64K queues)
* support MSI-X and interrupt steering
* streamlined & simple command set optimized for NVM (13 required commands)
* optional features to address target segment of product in client or enterprise.
* enterprise: end-to-end data protection, reservation etc.
* client: autonomous power state transitions etc.
* designed to scale for next generation NVM agnostic to NVM type used. (all about queues !!)

"Memory like" attributes possible with next generation NVM.New programming models are needed to take full advantage.

Recall:
* Transformation was needed for full benefits of multi-core CPU. Application and OS level changes required.
* To date, SSD have used the legacy interface of HDD based on single slow rotating platters...
* SSDs are inherenly parallel and next generation NVM approaches DRAM-like latencies.

4.) HotChips conference - Rado Danilak - Skyera

* using legacy interface (SAS/SATA) limits bandwidth/throughput possible from SSD
* PCIe storage is internal (drawback)

* 100x life amplification of latest generations of flash
* minimize writes to flash (compress & de-deuplication)
* new RAID algorithms (RAID6 - causing 3x quicker endurance)
* new DSP/ECC
* adaptive reads & writes
* device physics manipulations

Lower voltage on SSD generates new problems - artifacts

Optimize the storage stack :
* System : compression, de-duplication, encryption in hardware to minimize WRITES to the flash
* RAID : must achieve better than RAID6 reliability with much fewer WR to flash
* Flash Controller : develop more sophisiticated DSP and ECC algorithms
* FTL : flash physics manipulations to optimize for system-wide wearleveleling
* Adaptive RD & WR based on usage patterns
* filesystems are not optimized for flash

Why do we need flash aware filesystem?
We have plenty of filesystems to choose from but they are written for HDD in mind.

ext3/4 - no inline de-duplication, no compression, no snapshots,no clones  etc.

ZFS - feature rich filesystem (compression, de-dupe, snapshot, clones etc.). ZFS is written in journal way.When system become full, the Garbage Collection must be run, GC results in WRITE amplification.Don't penalize flash,

Skyera developed own filesystem for All-Flash-Array.

Why do not use commodity hardware and differenciate in software?
 
To take 0.5 PB capacity with x86 system it will take more than 1 rack !! If you build non-commodity hardware you can fit in 1U !!!  Much less hardware is needed to build non-commodity storage it means lower cost in comparison to commodity x86 - that sound counter-intuitive.

Doesn't using commodity hardware waste a lot of the value of the flash? If you put so much in there, you don't have good balance of the networking bandwidth?

The biggest problem is how to deliver this flash performance to the servers? The network is a bottleneck, conventional iSCSI and FC is to slow.
PCI people love latency and bandwidth. Skyera introduced 1U solution which has 96 PCI lanes and connect 24 to 96 servers to share the device. 80% of market is shared storage (not PCIe cards or local HDDs). They have to close the networking gap - the storage and networking is completely merge.



5.) HotChips conference - Kevin Rovett - Violin Memory

Design choices & tradeoffs

Flash devices are:

* READ 70 microsec, WRITE 1-2 milisec
* and then there is block ERASE : 5-10milisec
* block failures
* dia failures
* read bit errors, program errors, read disturbance
* parameters changes on each new process update
* many different devices to choose from.

Summary :
- flash is not "slow DRAM"
- lots of insight is needed

What is important for customer? Latency? Bandwidth? Retention?

Flash controller considerations:

* schedule I/O
-- layout blocks in devices based on expected usage and device access characteristics
-- device blocks are 8K or 16K usually not to spread WRITES across more devices.
-- avoid "WRITE cliff" effect managing I/O types and GC (garbage collection) accross multiple blocks and devices
-- traditionally all done in the so called "FTL : Flash Translation Layer"

Flash controller design: error handling
* Error correction
-- every READ has many bit errors, so must use some form of forward error correction: BCH and/or LDPC (Low Density Parity Coding)
-- BCH is dia area intensive and Flash spare area consumer
-- might not provide enough correction past 19nm
-- lowest latency
-- LDPC in conjunction with BCH provides much better error rate
-- requires special access to device
-- increase latency as much as 4 times

Violin performance expectations:

* ultra low latency for RD and WR - 20-40 microsec (25x faster than HDD)
* almost unlimited IOPS- 500_000 1_000_000 in 3V (1000x improvement over HDD)
* high throughput -  3-5 GB/s in 3V (20x more than HDD)
* zero contention under load- no tunning - no workload separation

The large amount of SSDs and ambient temperature effect?
Flash doesn't mind if you raise the temperature as long as you keep it in the spec! Actually SSDs work better if you raise temperature. You really have to design rest of the system to support 45-55C - very difficult to do.
This is more FPGA than anything else.
 
In Violin air flow through flash devices is hotter and improve the retention of SSD - it sounds counter intuitive.

6.) HotChips conferece - Neil Vachharajami - Pure Storage

Why not just put SSDs into todays disk arrays?


* current software systems are optimized for disks (HDD)
* flash and disks are very different
* need storage arrays designed to leverage flash

7.) HotChps conference - David Flynn - Fusion-IO - Primary Data

Cloud => $/GB [durability]
Fusion-IO => $/IOPS [mutuality]

Where we will see a growth?
FOBS - File and Object Based Storage, software defined and scale-out

Scale-up storage will fade.

Call to action:
* encryption (security)
* compression (capacity efficiency)
* erasure coding (capacity efficiency)
* hash calculation ( capacity & data integrity)
These things need to go to Datapath!

* People stucks with current block abstraction and forcing to mapping FTL into SSD.

* New memory-speed NVM to replace flash
* CPUs that deal with wide range of memory access delay

to be continued...