My way...: 2015

Monday 28 December 2015

NEC HYDRAstore - the global de-duplicated storage - research

The father of HYDRAstore is a Polish computer scientist Cezary Dubnicki the CEO of 9livesdata.com.

HYDRAstore is the fastest and most scalable backup system on the world.

To learn more about the system I highly recommend to watch the TechFieldDay:

http://techfieldday.com/appearance/nec-storage-presents-at-storage-field-day-6/

You can read the solid summary of this event on Chin-Fah Heoh blog :

http://storagegaga.com/hail-hydra/

However, I don't agree with this conclusion:

"deduplication solutions such as HydraStor, EMC Data Domain, and HP StoreOnce, are being superceded by Copy Data Management technology, touted by Actifio."

I believe that HYDRAstore approach is very uniq and Actifio 'Copy Data Managment' seems to be similar - I won't be surprised if Actifio is using the cryptographic hash table as well for their VDP (VirtualData Pipeline) with some 'magic' souse. BTW I am huge supporter of Actifio too but I couldn't find any deep dive materials.

What I really like about the HYDRAstore and 9livesdata is that they realy share real knowledge without any marketing yadda..yadda.. (I know NEC HYDRAstore marketing is quite ancient like their GUI:)

But those White Papers defends itself:

"Reducing fragmentation impact with forward knowledge in backup systems with deduplication" – SYSTOR'15, Haifa, Izrael
"Fuzzy adaptive control for heterogeneous tasks in high-performance storage systems" – SYSTOR'13, Haifa, Izrael
"Concurrent Deletion in a Distributed Content - Addressable Storage System with Global Deduplication" – FAST'13, San Jose, USA
"Reducing Impact of Data Fragmentation Caused By In-Line Deduplication" – SYSTOR'12, Haifa, Izrael
"Anchor-driven subchunk deduplication" – SYSTOR'11, Haifa, Izrael"Bimodal Content Defined Chunking for Backup Streams" – FAST'10, San Jose, USA
"HydraFS: A High-Throughput File System for the HYDRAstor Content- Addressable Storage System" – FAST'10, San Jose, USA
"HYDRAstor: a Scalable Secondary Storage" – FAST'09, San Francisco, USA
"FPN: A Distributed Hash Table for Commercial Applications" – HPDC'04, Honolulu, USA

the end.

Thursday 24 September 2015

My favorite Linux/VMware ESXi console keyboard shortcuts.

I am working a lot with the text console and love keyboard shortcuts which improve my work efficiency.Some of them can be used on VMware ESXi text console((VMware))

Ctrl + U     Clears the line before the cursor position. If you are at the end of the line, clears the entire line.((VMware))
Ctrl + H     Same as backspace.((VMware))
Ctrl + R     Let’s you search through previously used commands.((VMware))
Ctrl + C     Kill whatever you are running. ((VMware))
Ctrl + D     Exit the current shell. ((VMware))
Ctrl + Z     Puts whatever you are running into a suspended background process. fg restores it.((VMware))
Ctrl + W     Delete the word before the cursor. ((VMware))
Ctrl + K     Clear the line after the cursor. ((VMware))
Ctrl + T     Swap the last two characters before the cursor. ((Linux only))
Ctrl + Y     Paste the content of buffer after Ctrl + K or Ctrl + U usage. ((Linux only))
Ctrl + A     Go to the beginning of the line you are currently typing on. ((VMware))
Ctrl + E     Go to the end of the line you are currently typing on. ((VMware))
Ctrl + L     Clears the Screen, similar to the clear command. ((VMware))
Esc + T      Swap the last two words before the cursor. ((Linux only))
Alt + F      Move cursor forward one word on the current line. ((VMware))
Alt + B      Move cursor backward one word on the current line. ((VMware))
Alt + .      Paste the last argument of last command. ((Linux only))
Tab          Auto-complete files and folder names. ((VMware))

The End.

Saturday 15 August 2015

vSphere ESXi Shell console terminated - ALT+F1 doesn't work ?

We have encountered interesting 'issue' recently. We couldn't login into ESXi Shell console no prompt to login even when ESXi Shell was enabled.

We press ALT+F1 and got:

Mistakenly, I thought that greyed timeouts mean that timeout is disabled but that's not true!

We checked the timeouts using ssh:

# esxcli system settings advanced list -o /UserVars/ESXiShellTimeOut

   Path: /UserVars/ESXiShellTimeOut
   Type: integer
   Int Value: 1
   Default Int Value: 0
   Min Value: 0
   Max Value: 86400
   String Value:
   Default String Value:
   Valid Characters:
   Description: Time before automatically disabling local and remote shell access (in seconds, 0 disables). Takes effect after the services are restarted.

The timeout value was incorrectly set to 1 what is 1 second - our ESXi Shell was enabled but timed out after one second :)

It seems that during customizing the ESXi host using e.g. PowerCLI the value '1' was set instead of '0' to disable timeouts.

So, why the timeout is greyed out? I found explanation in this VMware KB:

http://kb.vmware.com/kb/2004746

Long story short this is ESXi Shell default behavior - timeout are grayed out as long as ESXi Shell is Enabled or SSH is enabled or both !

Disabling ESXi Shell and SSH - enable the timeouts modification:

the end.

Sunday 19 July 2015

Flash Memory - SSDs disks - research continued

The notes from Rex Walters (Tintri) session VMworld 2013: "Flash Storage Deep Dive"

Flash is changing storage, but is very different than hard disks drives (HDD)

For long time tape drives were a primary storage. Tape drives are very efficient in streaming data. Hard drives were revolutionary because of random access.

In a future flash will play role of primary storage - HDD as a secondary storage.

Flash is fast if you use it correctly !

We have 50 years of experience how to use hard disk drives (HDD)

NAND Flash & SSDs basics

Human-being learn how to use things making abstractions.

Floating gate -> NAND flash -> tunnel injections

SCL - Single Cell Layer
MCL - Multi Cell Layer - measure how much current is flowing (more states)

Flash is organised into pages 4KB or 8KB.

Not write-in-place - if you write data to SSD you don't know where this data is located, only SSD controller knows.

READING and WRITING is organised into pages granularity.
ERASE is organised into "erasure blocks"

Inside every SSDs we have process called Garbage Collector (GC) which scans all the blocks and tried to find blocks with a lot of dead pages (space to reclaim) and GC move "live" blocks somewhere else and erase entire thing (block).

You can only do a finite count of ERASURES - wear-levelling

HDD or SSD - they are block storage devices

SCSI protocol was developed by Larry Boucher - before SCSI we access HDD using direct access through H/C/S Head/Cylinder/Sector.

SCSI protocol makes assumptions that HDD is a contiguous series of blocks.

SCSI is a complex I/O protocol mainly using:
READ(offset, count) and WRITE(offset,count)

SCSI is for serial access (tape and HDD) - let's make HDD looks like a tape ;-)

File is a contiguous sequence of bytes instead of blocks. we have byte granularity of access.

From VMware perspective we talking to the file (vmdk).

If we use VMFS the meta-data is on the host if we talk to NFS the meta-data is on the array.

If you want to keep the data safe you want to move it periodically, we store data with the some CRC and every time we access data we check CRC. What if we have 1 year old backup which never was restored ? What is a chance that data is corrupted?

Other SSD vs. HDD differences/implications

* Asymmetric read/write performance
* wear-levelling, garbage collections and FTL - data always in motion
* Parallel I/O channels with RAM buffers
* Major implementation variations with every new version of firmware
* Flash replaces known failure mode with completely new ones.
* limited endurance (P/E Cycles)
* all SSDs behave different with different firmwares
* read disturbance - if you read the same cell over and over again,

the process can reprogram neighbours cells.

The Latency and IOPS for HDDs:

Average is a reality - Best never happened outside lab ;-)

For HDD the IOPS are the bottleneck but the biggest blocks the more throughput.

For SSDs the throughput is a bottleneck the block size doesn't matter.

Storage Systems must evolve:

* align to SSD pages
* keep meta-data in flash
* break up large WRITES (allow interleaved READs)
* anticipate power-loss at any time (checkpoints)
* checksum data and meta-data references
* verify on read and scrub (most data is 'cold' if you find corruptions after 6 months it is bad that's why we scrubbing more frequently).
* self-heal with multi-bit redundancy
* paranoid READS with de-dupe - strong cryptographic SHA-1

Conclusions:

* Understand how flash works: it's utterly different than rotating data
* Flash differs spectacular performance in-place, but flash fails in surprising new ways
* Design for strength and around the weakness - don't use it like a HDD !

Sunday 12 July 2015

Flash storage - SSD disks - NV-Memory - RESEARCH & NOTES

I've focused on flash storage recently. Below my research and notes from different sources.

1.) HotChips conference - notes from Jim Hardy 'Flash Storage'presentations:

By 2020 we will have :

~ 35ZB data (according to CSC)
~ 50 billion things on internet (according to Cisco)

* If CPU access is as a heartbeat, DRAM access is as walking a mile, HDD access is as bicycle from San Francisco to Miami.

* When you put SSD into Storage Array you still have network/SAN fabric delay. If you put SSD into a server you accelerate the server but causing some problems with coherency.

* Other Flash benefits:
- less power & cooling
- improved reliability: fewer things to fail
- floor space reduction
- lower licencing fees (consolidate servers)
- faster error recovery:
-- RAID rebuild
-- backup restore
--snapshots

* Samsung went (2013) with 3D Flash.

* NANDs scaling limits:
- too few electrons per gate
- needs constant shrinking for cost reduction
- 4bits/cell hard to make:
-- this may be the max possible
* other technologies will scale past NANDs
-- PCM, MRAM, RRAM, FRAM ...
-- not yet clear which one will win...

NAND doesn't do 'byte' writes, it does 'page' writes, but it also doesn't overwrites. You have to erase first and write new data. When you moving data from block to block you have to think about wearing.

New NV-memories are better than NAND:

NAND:
* serial read
* erase before write
* block erase/page write
* slow writes
* inherent bit errors
* wear

New NV-memories:
* random read
* overwrite
* byte write
* fast write
* lower bit errors
* low/no wear

Future: Storage Class Memory

NV-memories won't cross HDD $/GB !!!!
New NV-memories will required new computing architecture

2.) HotChips conference - Krishna Parat 'NAND Technologies' presentation:

NAND Flash key attributes:
* non-volatile memory
* READ access time of 10's of microsecond
* WRITE/ERASE time of ~ milisecond
* page programming
* block erase
* 10's of thousen cycles WR/E endurence
* NAND has a simple cell and array structure

Program => Electrons stored on the Floating Gate -> High Vt

Erase => remove electrons from Floating Gate - Low Vt

Read => look for current through the cell at given gate bias

* Programming is by tunneling electrons through the Tunnel-ox by applying a high gate voltage and grounding well and source/drain.

* Erase is by tunneling electrons through the Tunnel-ox applying a high Well voltage and grounding the gate (P-Well - where electrons flow)

* Cell area has scaled per Moore's law predictions into the mid 20nm without much difficulties.

* NAND cell scaling issue at 20nm 'interreference concerns' with wrapped cells.

Planar Floating gate cell with High-K/Metal gate successfully overcomes the scaling hurdle for 20nm and beyond.

* Program/Erase voltages ~20V !!

* Large voltage + small pitch = Reliability Risk
* At ~10nm WL-WL electric field ~ 10MV/inch !

2D NAND -> 3D NAND
* vertical NAND string
* conductive or dielectric storage node
* deposited poly-silicon channel
* large footprint of the 3D cell with required quite a few layers to be stocked to achieve effective cell area scalling
* increase demands on process technology very high aspect ratio etches & fills.

Computing 3D cell structure options:
* vertical string vs. horizontal string
* vertical is more attractive for electric properties
* horizontal string is more attractive for cell size
* either cases will lead to increase block size

Summary:
* NAND flash has a simple array structure which has been highly amendable to scaling
* NAND flash leads the industry in scaling
* Lithography induced scaling limits were overcome using advanced pitch reduction techniques.
* Interferences issues were contained through interpolation of air-gap at critical locations.
* Wrap cell limits were overcome with planar Floating Gate cell using High-K dielectric/Metal gate.
* 2D scalling can continue into the mid to low ~10nm ,scalling beyond can come from transitioning to 3D.

3.) HotChips conference - Amber Huffman 'PCIe Storage' presentation:

* NVM express:
PCIe SSD are emerging in Datacenters/Enterprises co-existing with SAS/SATA depending on applications.

* PCIe great interface for SSDs:
-- 1GB/s per lane (PCIe G3 x1)
-- 8GB/s per device (PCIe G3 x8) or more
-- low latency - platform adapter: 10microsec down to 3microsec
-- lower power
-- lower cost (?)
-- PCIe lanes off the CPU 40 G3 (80 in dual sockets)

* NVM express is the interface architected from the ground up for NAND today and next generations NV-memories

NVMe express technical basics:
* all parameters for 4KB command in single 64B command
* support deep queues (64K commands per queue, up to 64K queues)
* support MSI-X and interrupt steering
* streamlined & simple command set optimized for NVM (13 required commands)
* optional features to address target segment of product in client or enterprise.
* enterprise: end-to-end data protection, reservation etc.
* client: autonomous power state transitions etc.
* designed to scale for next generation NVM agnostic to NVM type used. (all about queues !!)

"Memory like" attributes possible with next generation NVM.New programming models are needed to take full advantage.

Recall:
* Transformation was needed for full benefits of multi-core CPU. Application and OS level changes required.
* To date, SSD have used the legacy interface of HDD based on single slow rotating platters...
* SSDs are inherenly parallel and next generation NVM approaches DRAM-like latencies.

4.) HotChips conference - Rado Danilak - Skyera

* using legacy interface (SAS/SATA) limits bandwidth/throughput possible from SSD
* PCIe storage is internal (drawback)

* 100x life amplification of latest generations of flash
* minimize writes to flash (compress & de-deuplication)
* new RAID algorithms (RAID6 - causing 3x quicker endurance)
* new DSP/ECC
* adaptive reads & writes
* device physics manipulations

Lower voltage on SSD generates new problems - artifacts

Optimize the storage stack :
* System : compression, de-duplication, encryption in hardware to minimize WRITES to the flash
* RAID : must achieve better than RAID6 reliability with much fewer WR to flash
* Flash Controller : develop more sophisiticated DSP and ECC algorithms
* FTL : flash physics manipulations to optimize for system-wide wearleveleling
* Adaptive RD & WR based on usage patterns
* filesystems are not optimized for flash

Why do we need flash aware filesystem?
We have plenty of filesystems to choose from but they are written for HDD in mind.

ext3/4 - no inline de-duplication, no compression, no snapshots,no clones etc.

ZFS - feature rich filesystem (compression, de-dupe, snapshot, clones etc.). ZFS is written in journal way.When system become full, the Garbage Collection must be run, GC results in WRITE amplification.Don't penalize flash,

Skyera developed own filesystem for All-Flash-Array.

Why do not use commodity hardware and differenciate in software?

To take 0.5 PB capacity with x86 system it will take more than 1 rack !! If you build non-commodity hardware you can fit in 1U !!! Much less hardware is needed to build non-commodity storage it means lower cost in comparison to commodity x86 - that sound counter-intuitive.

Doesn't using commodity hardware waste a lot of the value of the flash? If you put so much in there, you don't have good balance of the networking bandwidth?

The biggest problem is how to deliver this flash performance to the servers? The network is a bottleneck, conventional iSCSI and FC is to slow.
PCI people love latency and bandwidth. Skyera introduced 1U solution which has 96 PCI lanes and connect 24 to 96 servers to share the device. 80% of market is shared storage (not PCIe cards or local HDDs). They have to close the networking gap - the storage and networking is completely merge.

5.) HotChips conference - Kevin Rovett - Violin Memory

Design choices & tradeoffs

Flash devices are:

* READ 70 microsec, WRITE 1-2 milisec
* and then there is block ERASE : 5-10milisec
* block failures
* dia failures
* read bit errors, program errors, read disturbance
* parameters changes on each new process update
* many different devices to choose from.

Summary :
- flash is not "slow DRAM"
- lots of insight is needed

What is important for customer? Latency? Bandwidth? Retention?

Flash controller considerations:

* schedule I/O
-- layout blocks in devices based on expected usage and device access characteristics
-- device blocks are 8K or 16K usually not to spread WRITES across more devices.
-- avoid "WRITE cliff" effect managing I/O types and GC (garbage collection) accross multiple blocks and devices
-- traditionally all done in the so called "FTL : Flash Translation Layer"

Flash controller design: error handling
* Error correction
-- every READ has many bit errors, so must use some form of forward error correction: BCH and/or LDPC (Low Density Parity Coding)
-- BCH is dia area intensive and Flash spare area consumer
-- might not provide enough correction past 19nm
-- lowest latency
-- LDPC in conjunction with BCH provides much better error rate
-- requires special access to device
-- increase latency as much as 4 times

Violin performance expectations:

* ultra low latency for RD and WR - 20-40 microsec (25x faster than HDD)
* almost unlimited IOPS- 500_000 1_000_000 in 3V (1000x improvement over HDD)
* high throughput - 3-5 GB/s in 3V (20x more than HDD)
* zero contention under load- no tunning - no workload separation

The large amount of SSDs and ambient temperature effect?
Flash doesn't mind if you raise the temperature as long as you keep it in the spec! Actually SSDs work better if you raise temperature. You really have to design rest of the system to support 45-55C - very difficult to do.
This is more FPGA than anything else.

In Violin air flow through flash devices is hotter and improve the retention of SSD - it sounds counter intuitive.

6.) HotChips conferece - Neil Vachharajami - Pure Storage

Why not just put SSDs into todays disk arrays?

* current software systems are optimized for disks (HDD)
* flash and disks are very different
* need storage arrays designed to leverage flash

7.) HotChps conference - David Flynn - Fusion-IO - Primary Data

Cloud => $/GB [durability]
Fusion-IO => $/IOPS [mutuality]

Where we will see a growth?
FOBS - File and Object Based Storage, software defined and scale-out

Scale-up storage will fade.

Call to action:
* encryption (security)
* compression (capacity efficiency)
* erasure coding (capacity efficiency)
* hash calculation ( capacity & data integrity)
These things need to go to Datapath!

* People stucks with current block abstraction and forcing to mapping FTL into SSD.

* New memory-speed NVM to replace flash
* CPUs that deal with wide range of memory access delay

to be continued...

Friday 15 May 2015

Cisco UCS C-Series servers CIMC SNMP configuration and features.

SNMP was designed as a response/request protocol. A network-management system that wants to inquire about the condition of a managed device issues a GET command to retrieve an object from the agent on the managed device. If the object is one of several objects in a list or table, the network-management system can use the GETNEXT command to retrieve the next object.

The network-managment system can also use SNMP to control the managed-device by using the SET command to change the value of an object [WARNING! SET command is NOT supported on C-Series servers]. If the managed-device needs to notify the network-management station of some event, it can issue a TRAP command to pass messages to the NMS (Network Management Station).

Cisco C-Series CIMC SNMP implementation supports SNMP version 2 and 3.Using this protocol we can monitor hardware e.g. CPU temperature etc.

The SNMP is not enabled and configured on C-Series CIMC by default. We have to enabled it and configure:

* Login to CIMC using IP address or FQDN name:

Choose -> Admin TAB -> Communications Services -> SNMP TAB -> Check BOX 'SNMP Enabled' and enter appropriate values into the fields.

* Configure your NMS (Network Managemet Station) IP address which will receive TRAPS from C-Series Server CIMC:

Choose -> Admin TAB -> Communications Services -> SNMP TAB -> Trap Destinations TAB (Right Side of the pane) -> click Add button and enter appropriate values.

* we can login to CIMC via SSH to check or set snmp configuration.

Under the hood of CIMC we have Linux OS with running snmpd daemon. This daemon can crash.Unfortunately to get access to Linux we have to involve Cisco TAC which will enable Linux shell using so called debug-plugin. I hope that Cisco will eventually open access to that shell following good example of Arista Networks.

We still can troubleshoot SNMP using net-snmp-utils package on your Linux box. (I am using Fedora on my laptop.)

# yum -y install net-snmp\*

The net-snmp-utils package includes a number of useful applications for querying managed devices from the command line. The snmpwalk command uses SNMP GetNext commands to recursively browse the MIB-tree below a given Table. For example, the following command will return a list of all objects subordinate to the 'cucsProcessorUnitTable'

# snmpwalk -v2c -t 15 -c public -mALL -M /home/user/Cisco_MIBs <CIMC IP ADDRESS> <OID or TEXT NAME OF OID>

-v2c : SNMP ver.2
-t : timeout 15sec
-c : community string
-mALL : all MIBs
-M : all MIBs in directory

To download Cisco UCS C-Series Manager MIBs please go to link:

ftp://ftp.cisco.com/pub/mibs/supportlists/ucs/ucs-C-supportlist.html

How to translate OID to text name? We can use snmptranslate command:

If you can't remember the branch of the MIB tree that contains particular object, use command below:

#snmptranslate -IR -mALL -M /home/user/CISCO_C-Series_MIBs/UCS_Cseries2.0/ cucsProcessorUnitTable
CISCO-UNIFIED-COMPUTING-PROCESSOR-MIB::cucsProcessorUnitTable

To obtain OID of text name use command below:

#snmptranslate -On -mALL -M /home/user/CISCO_C-Series_MIBs/UCS_Cseries2.0/ CISCO-UNIFIED-COMPUTING-PROCESSOR-MIB::cucsProcessorUnitTable
.1.3.6.1.4.1.9.9.719.1.41.9

To get text name knowing OID use this command:

#snmptranslate -mALL -M /home/user/CISCO_C-Series_MIBs/UCS_Cseries2.0/ .1.3.6.1.4.1.9.9.719.1.9.6
CISCO-UNIFIED-COMPUTING-COMPUTE-MIB::cucsComputeBoardTable

To print the branch of tree for particular object ID:

#snmptranslate -Tp -mALL -M /home/kb/Documents/CISCO_C-Series_MIBs/UCS_Cseries2.0/ .1.3.6.1.4.1.9.9.719.1.9.6
+--cucsComputeBoardTable(6)
|
+--cucsComputeBoardEntry(1)
...snip...snip...

Using the command above we can find the values of given object:

If you prefer GUI you can use MIB browsers e.g.

http://ireasoning.com/mibbrowser.shtml

or

https://www.manageengine.com/products/mibbrowser-free-tool/

To test your configuration you can send test trap using 'Send SNMP Test Trap' button in CIMC web console. Or if you prefer hard and unsupported way and if Cisco TAC open Linux shell entering debug-plugin you can use ipmitool and change thresholds (again this is unsupported!!! but much more fun ;-)) e.g. FAN and monitor traps on your NMS.

the end.

Friday 13 March 2015

Incremental restore of vmdk from Symantec Netbackup is corrupted.

During restore tests one of the critical MSSQL VM failed to restore from incremental Symantec Netbackup backup. We were able to restore successfuly VM from full backup.

Restoring from incremental backup lead to corrupted state of Windows.

We checked any leftover snapshots, cleaned up the VM directory.

The key thing was VMtools version. The current ESXi build is
ESXi 5.1.0 Update 1 build-1312873.

VMtools version installed inside affected MSSQL VM - 9.0.0.15210 – esx5.1 GA (799733), the recommended version of VMtools 9.0.5.21789 - esx5.1p03 (1312873).

Please check: http://packages.vmware.com/tools/versions

The virtual hardware version of MSSQL VM:

virtualHW.version = "8"

Based on http://blogs.vmware.com/vsphere/2013/02/clarification-on-zero-down-time-vmware-tools-uprade-in-vsphere-5-1.html

Zero-downtime VMtools upgrade is possible on virtualHW 9 or higher.

In this case reboot was required to get all new features of VMtools.

We scheduled maintenance window with MSSQL DBA, upgraded VMtools and test restore from incremental backup - it finished SUCCESSFULLY we can login into VM GuestOS without any issue.