Sunday 19 July 2015

Flash Memory - SSDs disks - research continued

The notes from Rex Walters (Tintri) session VMworld 2013: "Flash Storage Deep Dive"

Flash is changing storage, but is very different than hard disks drives (HDD)

For long time tape drives were a primary storage. Tape drives are very efficient in streaming data. Hard drives were revolutionary because of random access.

In a future flash will play role of primary storage - HDD as a secondary storage.

Flash is fast if you use it correctly !

We have 50 years of experience how to use hard disk drives (HDD) 

NAND Flash & SSDs basics

Human-being learn how to use things making abstractions.

Floating gate -> NAND flash -> tunnel injections

SCL - Single Cell Layer
MCL - Multi Cell Layer - measure how much current is flowing (more states)

Flash is organised into pages 4KB or 8KB.

Not write-in-place - if you write data to SSD you don't know where this data is located, only SSD controller knows.

READING and WRITING  is organised into pages granularity.
ERASE is organised into "erasure blocks"

Inside every SSDs we have process called Garbage Collector (GC) which scans all the blocks and tried to find blocks with a lot of dead pages (space to reclaim) and GC move "live" blocks somewhere else and erase entire thing (block).

You can only do a finite count of ERASURES - wear-levelling

HDD or SSD - they are block storage devices

SCSI protocol was developed by Larry Boucher - before SCSI we access HDD using direct access through H/C/S Head/Cylinder/Sector.

SCSI protocol makes assumptions that HDD is a contiguous series of blocks.

SCSI is a complex I/O protocol mainly using:
READ(offset, count) and WRITE(offset,count)

SCSI is for serial access (tape and HDD) - let's make HDD looks like a tape ;-)

File is a contiguous sequence of bytes instead of blocks. we have byte granularity of access.

From VMware perspective we talking to the file (vmdk).

If we use VMFS the meta-data is on the host if we talk to NFS the meta-data is on the array.

If you want to keep the data safe you want to move it periodically, we store data with the some CRC and every time we access data we check CRC. What if we have 1 year old backup which never was restored ? What is a chance that data is corrupted?

Other SSD vs. HDD differences/implications 

* Asymmetric read/write performance
* wear-levelling, garbage collections and FTL - data always in motion
* Parallel I/O channels with RAM buffers
* Major implementation variations with every new version of firmware
* Flash replaces known failure mode with completely new ones.
* limited endurance (P/E Cycles)
* all SSDs behave different with different firmwares
* read disturbance - if you read the same cell over and over again,
the process can reprogram neighbours cells. 

The Latency and IOPS for HDDs:





  Average is a reality - Best never happened outside lab ;-)


For HDD the IOPS are the bottleneck but the biggest blocks the more throughput.



For SSDs the throughput is a bottleneck the block size doesn't matter.

 

Storage Systems must evolve:

* align to SSD pages
* keep meta-data in flash
* break up large WRITES (allow interleaved READs)
* anticipate power-loss at any time (checkpoints)
* checksum data and meta-data references
* verify on read and scrub (most data is 'cold' if you find corruptions after 6 months it is bad that's why we scrubbing more frequently).
* self-heal with multi-bit redundancy
* paranoid READS with de-dupe - strong cryptographic SHA-1

Conclusions:

* Understand how flash works: it's utterly different than rotating data
* Flash differs spectacular performance in-place, but flash fails in surprising new ways
* Design for strength and around the weakness - don't use it like a HDD !




 


 

No comments:

Post a Comment