Monday 26 May 2014

Notes and thougts after 'Storage Implementation in vSphere 5.0 by Mostafa Khalil' read.

I really enjoyed reading this book and I found here a lot of troubleshooting tips&hints. Mostafa has long experience in Vmware support and access to VMware developers therefore we have great real-life examples with some internals from people who implement important part of the code .I highly recommend this book which dives deep into vSphere storage implementations.


1. How to list WWNN and WWPN of HBA adapter:

# esxcfg-mpath -b | grep WWNN | sed 's/.*fc//;s/Target.*$//'

2. How to list Adapter WWPN and Target WWPN:

#  esxcfg-mpath -b | grep WWNN | sed 's/.*fc//' | awk '{ print $1, " ",$4, " ",$5, " ",$6, " ",$9, " ",$10 }'
Adapter:   WWPN: 20:00:00:06:f6:30:b4:ed Target:   WWPN: 50:06:01:64:3d:e0:2e:a0
Adapter:   WWPN: 20:00:00:06:f6:30:b4:ee Target:   WWPN: 50:06:01:65:3d:e0:2e:a0


VMware recommends single initiator zonning but single-initiator - multiple-target zoning is also acceptable unless the storage vendor does not support it.

3. How to list HBA adapters:

# esxcfg-scsidev -a
# esxcli storage core adapter list

HBA Name  Driver  Link State  UID                                   Description
--------  ------  ----------  ------------------------------------  ------------------------------------------------------
vmhba0    mptsas  link-n/a    sas.5588d09267486000                  (0:1:0.0) LSI Logic / Symbios Logic LSI1064E
vmhba1    fnic    link-up     fc.20000025b5011109:20000025b501a009  (0:10:0.0) Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba2    fnic    link-up     fc.20000025b5011109:20000025b501b109  (0:11:0.0) Cisco Systems Inc Cisco VIC FCoE HBA Driver


4. FCoE notes:

"FCoE runs directly on Ethernet (not on top of TCP or IP like iSCSI) as a Layer 3 protocol and cannot be routed. Based on that fact, both initiators and targets (native FCoE targets) must be on the same network. If native FC targets are accessible via FCoE switches, the latter must be on the same network as the FCoE initiators."

"Any link supporting DCBX must have LLDP enabled on both ends of the link for Transmit/Receive (Tx/Rx). If LLDP to be disabled on a port for either Rx or Tx; DCBX TLV within received LLDP frames are ignored. That is the reason why the NIC must be bound to the vSwitch. Frames are forwarded to the Datacenter Bridging Daemon (DCBD) to DCBX via the CDP vmkernel module. The latter does both CDP and LLDP."

"FCoE Initialization Protocol (FIP), Jumbo frame (actually baby jumbo frames, which are configured on the physical switch, are used to accomodate the FC frame payload which is 2112 bytes long), FCoE, and DCBX modules are enabled in ESXi 5 Software FCoE initiator by default."

"You can configure up to four (4) SW FCoE adapters on a single vSphere 5 host."

"The number assigned to the vmhba is a hint to whether it is Hardware or Software FCoE Adapter. vmhba numbers lower that 32 are assigned to Hardware (SCSI-related) Adapters, for example, SCSI HBA, RAID Controller, FC HBA, HW FCoE, and HW iSCSI HBA. vmhba numbers 32 and higher are assigned to Software Adapters and non-SCSI Adapters, for example, SW FCoE, SW iSCSI Adapters, IDE, SATA, and USB storage controllers."

5. How to list vml and the device IDs to which they link:

# ls -la /dev/disks/
# ls -la /vmfs/device/disks/

iSCSI notes:

6. How to check for iSCSI targets:

# esxcli iscsi adapter target list

7. How to identifying SW iSCSI Initiator's vmkernel ports:

# esxcli iscsi logicalnetworkportal list 
 
8. How to dump SW iSCSI database:

# vmkiscsid --dump-db=<file-name> 
The dump includes the following sections:
* ISID: iSCSI Session ID information
* InitiatorNodes: iSCSI Inititor information
* Targets: iSCSI Targets information
* discovery: Target Discovery information
* ifaces: The iSCSI Network configuration including the vmnic and vmknic names 

PSA notes:

"Runtime Name, as the name indicates, does not persist between host reboots. This is due the possibility that any of the components that make up that name may change due to hardware or connectivity changes. For example, a host might have an additional HBA added or another HBA removed, which would change the number assumed by the HBA."

9. How to list paths to device:

# esxcli storage nmp path list -d naa.123123123 | grep fc 

If you are using PowerPath you can run:

# /opt/emc/powerpath/bin/powermt display dev=all
 
10. How to identify NAA ID using the vml ID:

# esxcfg-scsidevs -l -d vml.123123 | grep Display

VMFS notes:

11. /vmfs/devices is a symbolic link to /dev on ESXi 5.x 

12. How to re-create partition table using partedUtil

a. healthy gpt vmfs partition looks like that:

# partedUtil getptbl /dev/disks/naa.60000970000292600926533030313237

gpt
31392 255 63 504322560
1 2048 504322047 AA31E02A400F11DB9590000C2911D1B8 vmfs 0


b. here how looks output of partedUtil when partition was deleted. In this case you will notice after rescan that affected LUN disapper from VIclient. 

# partedUtil getptbl /dev/disks/naa.60000970000292600926533030313237

gpt
31392 255 63 504322560


c. list partion's GUIDs

# partedUtil showGuids
 Partition Type       GUID
 vmfs                 AA31E02A400F11DB9590000C2911D1B8
 vmkDiagnostic        9D27538040AD11DBBF97000C2911D1B8
 VMware Reserved      9198EFFC31C011DB8F78000C2911D1B8
 Basic Data           EBD0A0A2B9E5443387C068B6B72699C7
 Linux Swap           0657FD6DA4AB43C484E50933C84B4F4F
 Linux Lvm            E6D6D379F50744C2A23C238F2A3DF928
 Linux Raid           A19D880F05FC4D3BA006743F0F84911E
 Efi System           C12A7328F81F11D2BA4B00A0C93EC93B
 Microsoft Reserved   E3C9E3160B5C4DB8817DF92DF00215AE
 Unused Entry         00000000000000000000000000000000  

 
d. get usable sectors for affected LUN:

# partedUtil getUsableSectors /dev/disks/naa.60000970000292600926533030313237
34 504322526

e. create new partition vmfs partition table:

# partedUtil setptbl "/dev/disks/naa.60000970000292600926533030313237" gpt "1 2048 504322526 AA31E02A400F11DB9590000C2911D1B8 0"
gpt
0 0 0 0
1 2048 504322526 AA31E02A400F11DB9590000C2911D1B8 0


f. check if partition was created:

# partedUtil getptbl "/dev/disks/naa.60000970000292600926533030313237"
gpt
31392 255 63 504322560
1 2048 504322526 AA31E02A400F11DB9590000C2911D1B8 vmfs 0


g. rescan datastores:

# esxcfg-rescan -A
or
# vmkfstools -V

h. check /var/log/vmkernel.log for potential issue logs

# tail -20 /var/log/vmkernel.log

g. refresh viclient storage datastore pane and affected LUN should pop-up

h. in case you get error: "The primary GPT table is corrupt, but the backup appears OK, so that will be used" run:

# partedUtil fix /dev/disks/naa.60000970000292600926533030313237

VMware does not offer data recovery services. 
Please see VMware KB: http://kb.vmware.com/kb/1015413


Virtual Disks and RDMs:

PVSCSI limitations:

- if you hot add or hot remove a virtual disk to the VM attached to the PVSCSI controller, you must rescan the SCSI BUS from within the GOS.
- if the virtual disks attached to the PVSCSI controller have snapshots, they will not benefit from the performance improvement
- if the ESXi host memory is overcommited, the VM does not benefit from the PVSCSI performance improvement
- PVSCSI controllers are not supported for GOS boot device
- MSCS clusters are not supported with PVSCSI.

Creating VM snapshot, parent disk remains unmodified. When the VM is powered on, parent disk is opened with Read-Only locks. This is the same function done by the VADP API (vStorage APIs for Data Protection) when it backs up a virtual disk while the VM is running. This allows the backup software to copy the parent disk since the Read-Only lock allows Multi-Readers to access and open the parent virtual disk for reads.

.vmdk - virtual disk - the file without flat suffix is the descriptor file, the file with flat is the extent file with data.The file with  delta is a Delta Disk Extent where new data is written after snapshot is taken,its type is vmfsSparse

.vmsn - vm snapshot file - this is the actual snapshot file, which is the state of the VM configuration. It actually combines the original unmodified content of both vmx  and vmxf files. If the Vm was power-on at the time of taking the snapshot and I chose to take a snashot of Vm's memory, this file would have included that as well as the CPU state.
  
.vmsd - virtual machine snapshot dictionary - it defines the snapshot  hierarchy.This file used to be blank before snapshot was taken

.vmx - virtual machine configuration file - describes the VM structure and virtual hardware

.vmxf - virtual machine foundry file - holds information used by vSphere Client when it connect to the ESXi host directly. This is a subset of information stored in the vCenter database.

Perennial Reservation:

Having MSCS cluster nodes spread over several ESXi hosts nieccessitates the use of passthrough RDMs, which are shared among all hosts on which a relevant cluster node will run. As a result, each of these hosts have some RDMs reserved whereas the remaining RDMs are reserved by the other hosts. At the boot time, LUN discovery and Device Claiming processes require a response from each LUN. Such a response takes much longer for LUNs reserved by other hosts. This results in an excessively prolonged boot time of all hosts with the configuration. The same issue might also affect the time for rescan operation to complete.

The concept of perennial reservations is a device property that makes it easier for an ESXi host to recognize if a given LUN is reserved by another host perennially. At boot time or upon rescanning, the host does not wait for a response from a LUN in that state. This improves boot and rescan times on ESXi 5 hosts sharing MSCS shared LUNs.(This property cannot be set via Host Profiles in ESXi 5.0)

# esxcli storage core device setconfig -d naa.123123 --perennially-reserved=true

Tuesday 20 May 2014

Troubleshooting vSphere Storage - Mike Preston - REVIEW

I have just finished reading "Troubleshooting vSphere Storage" by Mike Preston. My first impression is that this book only scratches the surface of huge and complex topic of storage troubleshooting in VMware infrastructure. In general I recommend it because it is "one evening read" and cheap. I found some inaccurate information and I believe that this information will be corrected in potential second edition.I learnt or refreshed some topics, it seems that author has some real experience but he did not delve deep enough.

CONS:

I don't agree with author that NFS is a distributed filesystem as VMFS. 
The sentence: "NFS, like VMFS,is also a distributed file system and has been around for nearly 20 years" is inaccurate IMHO or too simplified.

Please see: 
http://en.wikipedia.org/wiki/Network_File_System.

NFS stands for Network File System but it is protocol - distributed file system PROTOCOL. 
You cannot format your disk with NFS. You can have e.g. Nexenta DIY storage with ZFS filesystem and export volume to ESXi host via NFS protocol. (BTW: NFS and ZFS were invented by the Sun Microsystems). You can export NTFS volume from Windows machine  or  ext4 filesystem from Linux via NFS protocol to ESXi hosts.

Maybe I'm too pedantic but command:

cat /var/log/vmkernel.log | grep SCSI | grep -i Failed

can be replaced by:

egrep SCSI /var/log/vmkernel.log | egrep -i Failed

you save one process ;-) and less typing.

You can use short links to VMware KB instead long copy/paste:
http://kb.vmware.com/kb/289902

Sentence: "In order for ESXi to properly host VMs on NFS server, we need to have a minimum of read/write permission." is inaccurate.
You must have set option "no_root_squash" and read/write permission.If you have only read/write permission you can still create the data store, but you will not be able to create any virtual machines on it.

Please see:
http://www.vmware.com/files/pdf/VMware_NFS_BestPractices_WP_EN.pdf
http://kb.vmware.com/kb/1005948

Sentence: "Take for instance RAID5. When we issue one single write I/O to a group of drives in a RAID5 array, due to parity and spanning, the array actually incurs four writes or four IOPs." is inaccurate.

For a single write operation on RAID5, controller performs two disk reads and two disk writes for every WRITE operations, and WRITE penalty is 4.

PROS:

I like information about vCenter Storage Views (Reports and Maps).
"Storage View reports are a great way to view just how much snapshot space any VMs are using.That being said, what is displayed by default is not all that has been collected. If we right-click along the column header, we can see that there are many other column that can be added and removed from report, which can be resized, reordered, and sorted in any manner we prefer."

"You can filter any column using the filter box on the top right-hand corner.Simply select the column you would like to filter from selecting the down arrow and type in your expression in the filter box"

"Tip: Most of the storage logging in ESXi is enabled  by default; however, there are some things that are not. Look for parameters starting with SCSI.login the hosts advanced settings to see what is enabled and what is not. Simply toggling these values 1 or 0 turns them on and off."

vCenter Server Storage filters:

"Apart from Host Rescan Filter, the other filters do just as they describe; filter out LUNs. Although these filters are in place to protect us from inadvertently destroying or corrupting data, sometimes they introduce troubles when we actually have a legitimate reason to see the LUNs taht are being filtered. Take for instance setting up a MSCS inside vSphere. In order to do so, we need to map the same raw LUN or RDM to all virtual machines participating in the cluster. With RDM Filter running under its default configuration, only the host which is running the first VM to see the RDM would see this storage, and storage filters would filter that RDM on all other hosts.This is certainly a valid scenario where we would need to disable the RDN Filter in order to gain visibility to the LUN from other VMs on other hosts."

Open VIclient -> goto Administration drop-down menu -> vCenter Server Settings -> Advanced Settings -> add the appropriate key e.g. config.vpxd.filter.rdmFilter -- false

"Tip: The maximum LUN ID used in a storage rescan is configurable by an advanced setting within ESXi called Disk.MaxLun. This setting is located in the Advanced Settings section of the host's Configuration tab and must be set on a per-host basis. While lowering the default value can increase the speed of rescans and boot up, there is a risk that an administrator will provision a LUN with ID greater than what they have set in Disk.MaxLun. ESXi will never see or discover that LUN. This will occur even if you are under the configuration maximum of  256 LUNs per host. If we experience issues with discovering block storage and can't access the LUN initially, Disk.MaxLun is a great setting to check."

# esxcli system settings advanced list -o /Disk/MaxLUN
   Path: /Disk/MaxLUN
   Type: integer
   Int Value: 256
   Default Int Value: 256
   Min Value: 1
   Max Value: 256
   String Value:
   Default String Value:
   Valid Characters:
   Description: Maximum LUN id (N+1) that is scanned on a target


Table of required ports for NFS and iSCSI
NFS 111/udp/tcp
NFS 2049/udp/tcp <- ESXi 5.1.0 and above
iSCSI/tcp 3260

Enable verbose iSCSI logging:

# vmkiscsid -x "insert into internal (key,value) VALUES ('option.LogLevel'.'999');"

Disable verbose iSCSI logging:

# vmkiscsid -x "delete from internal where key='option.LogLevel';"

Enable verbose NFS logging:

# esxcfg-advcfg -s 1 /NFS/LogNfsStat3

Disable verbose NFS logging:

# esxcfg-advcfg -s 0 /NFS/LogNfsStat3

List, enable, disable verobose NFS logging using esxcli:

# esxcli system settings advanced list -o /NFS/LogNfsStat3
# esxcli system settings advanced set -o /NFS/LogNfsStat3 -i 1
# esxcli system settings advanced set -o /NFS/LogNfsStat3 -i 0

Check for 'pending reservation' I use -s switch for more readable output:

# esxcfg-info -s | egrep -B16 -i "pending reservations"

# vmkfstools -lock -lunreset /vmfs/devices/disks/naa.xxxyyyzzz

VMware KB how change Windows OS PVSCSI queue:
http://kb.vmware.com/kb/1017423

"In order to confirm that we have Jumbo Frames set up properly in our environment, we can pass an MTU size to our vmkping command that we send to the storage array[...] using the stabdard Jumbo Frames MTU of 9000 and header size of 28 bytes, we would use the following command to test our configuration:"

# vmkping -s 8972 -d 1.2.3.4

Verbose logging for fibre channel:

Advance Settings:

# esxcli system settings advanced list -o /Scsi/LogCmdErrors
# esxcli system settings advanced list -o /Scsi/LogScsiAborts
# esxcli system settings advanced set -o /Scsi/LogScsiAborts -i 1

Scsi.LogCmdErrors
Scsi.LogScsiAborts <- for ESXi 5.0.0[ build-821926] this settings is available only from GUI

SUMMARY:

I learnt and refreshed a lot of information from storage troubleshooting area. My next reading will be "Storage Implementation in vSphere 5.0" by Mostafa Khali and "vSphere High Performance Cookbook" by Prasenjit Sarkar








Wednesday 14 May 2014

How to check when VMFS volume was created?

VMFS UUID include at the begining epoch time when VMFS was created e.g.

Mark in red epoch time of VMFS creation. 

Run this command on ESXi console to check VMFS UUID

# ls -l /vmfs/volumes/ | awk -F '->' '/^l/ {print $2}' 

51788628-436b3484-80b4-4403a74a651f
5178864e-f6ae0d72-ea05-4403a74a651f
5170ffe3-7dbcd048-d989-4403a74a6429
5170ffd1-956e7d10-2c5a-4403a74a5699
51372ebf-d37fb174-650f-4403a74a5699
  
Unfortunately in vSphere ESXi version 5.0  is older version of date command:

# date --help
BusyBox v1.9.1-VMware-visor-8630 (2012-01-06 01:09:05 PST) multi-call binary

Usage: date [OPTION]... [MMDDhhmm[[CC]YY][.ss]] [+FORMAT]

Display current time in the given FORMAT, or set system date

Options:
        -R              Outputs RFC-822 compliant date string
        -d STRING       Displays time described by STRING, not 'now'
        -I[TIMESPEC]    Outputs an ISO-8601 compliant date/time string
                        TIMESPEC='date' (or missing) for date only,
                        'hours', 'minutes', or 'seconds' for date and
                        time to the indicated precision
        -D hint         Use 'hint' as date format, via strptime()
        -s STRING       Sets time described by STRING
        -r FILE         Displays the last modification time of FILE
        -u              Prints or sets Coordinated Universal Time



We need linux box to check time of vmfs creation:

# date -u -d @$((0x51372ebf))

Since verision 5.1 we have newest date command:

# date --help
BusyBox v1.19.0 (2012-02-29 14:20:08 PST) multi-call binary.

Usage: date [OPTIONS] [+FMT] [TIME]

Display time (using +FMT), or set time

        [-s,--set] TIME Set time to TIME
        -u,--utc        Work in UTC (don't convert to local time)
        -R,--rfc-2822   Output RFC-2822 compliant date string
        -I[SPEC]        Output ISO-8601 compliant date string
                        SPEC='date' (default) for date only,
                        'hours', 'minutes', or 'seconds' for date and
                        time to the indicated precision
        -r,--reference FILE     Display last modification time of FILE
        -d,--date TIME  Display TIME, not 'now'
        -D FMT          Use FMT for -d TIME conversion

Recognized TIME formats:
        hh:mm[:ss]
        [YYYY.]MM.DD-hh:mm[:ss]
        YYYY-MM-DD hh:mm[:ss]
        [[[[[YY]YY]MM]DD]hh]mm[.ss]



On vSphere 5.1 and above to check time of creation we can run from esxi console follwing command:


#date -u -d @$(($(awk 'BEGIN {printf "%0d\n",0x51372ebf;print""}')))
Wed Mar  6 11:55:43 UTC 2013

UPDATE: To check creation time of VMFS volume we can use vmkfstools it seems to be easiest way:

# vmkfstools -Ph -v10 /vmfs/volumes/<datastore name>

# vmkfstools -Ph -v10 /vmfs/volumes/Customer-1 | egrep Creation
Volume Creation Time: Thu Apr 25 01:26:00 2013


Now we know that VMFS was created in March. This information could be very useful in case if we suspect some VMFS corruption it is worth to check when VMFS volume was created sometimes someone could overwrite it