My way...: January 2015

In vmkernel.log file the ScisDeviceIO errors we are getting reported every 30 minutes:

CPUxx:32857)NMP: nmp_ThrottleLogForDevice:2331 Cmd 0x85 (0x412fc2d41c40, 34430) to dev "naa.xxxxxxxxxxxxxxxxx4a4" on path "vmhba0:C2:T1:L0" Failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:None

CPUxx : 32800)ScsiDeviceIO: 2337: Cmd(0x412fc2d41c40) 0x85, CmdSN 0x2ba4 from world 34430 to dev "naa.xxxxxxxxxxxxxxxxx4a4" failed H:0x0 D:0x2 P:0x0 Valid sese data: 0x5 0x20 0x0

CPUxx : 32800)ScsiDeviceIO: 2337: Cmd(0x412fc2d41c40) 0x4d, CmdSN 0x2ba5 from world 34430 to dev "naa.xxxxxxxxxxxxxxxxx4a4" failed H:0x0 D:0x2 P:0x0 Valid sese data: 0x5 0x20 0x0

CPUxx : 32800)ScsiDeviceIO: 2337: Cmd(0x412fc2d41c40) 0x1a, CmdSN 0x2ba6 from world 34430 to dev "naa.xxxxxxxxxxxxxxxxx4a4" failed H:0x0 D:0x2 P:0x0 Valid sese data: 0x5 0x20 0x0

CPUxx:32857)NMP: nmp_ThrottleLogForDevice:2331 Cmd 0x85 (0x412fc2d41c40, 34430) to dev "naa.xxxxxxxxxxxxxxxxx21a" on path "vmhba0:C2:T0:L0" Failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:None

CPUxx : 32800)ScsiDeviceIO: 2337: Cmd(0x412fc2d41c40) 0x85, CmdSN 0x2ba7 from world 34430 to dev "naa.xxxxxxxxxxxxxxxxx21a" failed H:0x0 D:0x2 P:0x0 Valid sese data: 0x5 0x20 0x0

CPUxx : 32800)ScsiDeviceIO: 2337: Cmd(0x412fc2d41c40) 0x4d, CmdSN 0x2ba8 from world 34430 to dev "naa.xxxxxxxxxxxxxxxxx21a" failed H:0x0 D:0x2 P:0x0 Valid sese data: 0x5 0x20 0x0

CPUxx : 32800)ScsiDeviceIO: 2337: Cmd(0x412fc2d41c40) 0x1a, CmdSN 0x2ba9 from world 34430 to dev "naa.xxxxxxxxxxxxxxxxx21a" failed H:0x0 D:0x2 P:0x0 Valid sese data: 0x5 0x20 0x0

These messages above can be safely ignored. According to the official T10 documentation 0x85 operation code is for ATA pass-through capability:

http://t10.t10.org/ftp/t10/document.04/04-262r8.pdf

The underlying SCSI drives do not support SCSI2 command 0x85 in response we received SCSI Sense Code: 0x20 0x0 what means Invalid Command (for example for Seagate SCSI drives http://seagate.com/support/disc/manuals/scsi/38479j.pdf .)

The SCSI commands 0x4d (Log Sense.) and 0x1a (Mode Sense(6)) return Invalid Command either.

The similar behaviour is described in http://kb.vmware.com/kb/1036874

IMHO that is not the best VMware KB but at least enlighten what cause the issue. It focuses on local HP SmartArray controller but is still valid for LSI controllers and SCSI sens code comes from disk.

===

In ESXi 5.1/5.5, you may see similar errors in the syslog.log file every 30 minutes. For example:
cpu60:16444)<4>hpsa 0000:03:00.0: Device:C4:B0:T0:L1 Command:0x85 CC:05/20/00 Illegal Request.
cpu42:1071571)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4125c3535000, 17495) to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" on path "vmhba4:C0:T0:L1" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
cpu42:1071571)ScsiDeviceIO: 2329: Cmd(0x4125c3535000) 0x85, CmdSN 0xebb from world 17495 to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
These messages can be safely ignored. They appear as a result of an ATA Pass-through command which cannot be interpreted by the controller.

===

KUDOS to Jeremy Chadwick from FreeBSD mailing list

I had a kind of unpleasant discussion with 'Senior Admin' about simulating blade failure by hot removing it from chassis. I consider it a bad idea. First of all, hot removing blades from chassis is very poor simulation of blade failure. The blade hardware component which will have the highest probability to fail will be memory DIMM module. This is because we have more and more DIMM modules in a blade with bigger and bigger capacity per module and our world has an entropy built-in by design... I hardly believe in escaping from slots blades as a hardware failure...

The main complain of 'Senior Admin' was that after running the test 'simulating' blade failure by plugging out/plugging in a blade into chassis slot one of 8 blade went into inoperable state and required one more full power-cycle - basically required re-acknowledgement from UCSM. 'Senior Admin' believed that blades were hot-swappable and that should have never happened, particularly because this blade was going into production. Yes, blades of all mainstream vendors are hot-swappable or hot-pluggable but only from chassis perspective, it means that to remove or to add new blade you don't have to power down entire chassis. It is just common sense to power down the blade during removing to avoid any data corruption on your OS which is using this blade server and to avoid human error !!! I would think twice if some 'Senior Admin' asked me to remove running blade in datacenter with thousands blades. Dear 'Senior', believe me, they all look the same with blinking LEDs ;-) So even when you 'simulate' the failure - just simply power off and HA will kick off.

Another aspect of this test is that we live in times when hardware is cheap and people are expensive (I know nobody believes me ;-)) it is not unusual that one sysadmin manage 10k servers. Please see link below:

http://highscalability.com/blog/2013/11/19/we-finally-cracked-the-10k-problem-this-time-for-managing-se.html

Dear 'Senior Admin' walking to datacenter to remove blades for 'failure simulation' before we go to production is aaa ... stupid idea?

All vendors to replace/remove the blade from chassis require power off of blade:

1.) Removing HP BL460cG8

2.) I love IBM documentation, how to remove IBM blade:
https://publib.boulder.ibm.com/infocenter/bladectr/documentation/index.jsp?topic=/com.ibm.bladecenter.hs22.doc/dw1iu_t_removing_the_blade_server.html

3.) How to remove Cisco B200M3 from chassis:

Hmm... but we know that UCS is slightly different and all 'brain power' of UCS is in Fabric Interconnect it is why we need to do some tasks on UCSM.

RTFM : http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-1/b_UCSM_GUI_Configuration_Guide_2_1.html

It seems that 'Senior Admin' missed one step ;-)

BTW Cisco documentation is a little bit tricky. You can safely ignore the section 'Before You Begin'...

But what if you remove server without decommissioning it first, well you have to follow this procedure:

I would say they were lucky that 7 blades didn't require re-acknowledgment.

Okay, stop ranting - nobody is perfect even 'Senior Admin' ;-) What I suggest before putting blade into production:

1.) RTFM !!! if you really love removing blades from chassis read documentation first how to do it correctly.

2.) Before going into production burn-in blade servers all vendors have diagnostics tool, please run it e.g. Cisco blades:

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/ucs_diagnostics/b_UCS_Blade_Server_Diagnostics_User_Guide/b_UCS_Blade_Server_Diagnostics_User_Guide_chapter_01.html

3.) Before putting server into production run memory test for at least 72 hours using e.g. memtest86+
http://www.memtest.org/

4.) We have much more elegant procedure to simulate blade failure.We can do it by triggering PSOD on ESXi host in controlled way. In ~95% cases of all blade hardware failures ESXi host will manifest with Purple Screen Of Death PSOD e.g. DIMM failure, motherboard failure, CPU failue etc. You do not need any mechanical procedure to plugging-out/plugging-in running blade and wasting your time in unhealthy datacenter environment.

a.)    Connect to esxi host via ssh

b.)    Run from command-line:

    ~ # vsish -e set /reliability/crashMe/Panic 1

c.)    On KVM console you will see PSOD like below:

d.) Wait on HA event and reboot the blade or you can use the coolness of Service Profiles if you booting from SAN:

You can put affected ESXi host into Maintenance Mode (if you can) and then power down.
Disassociate Service Profile from affected blade (UCSM will attempt to gracefully shutdown or power off the blade)
Associate the Service Profile of affected blade with spare blade it is around 5 minutes to spin up the new blade server in UCSM.
Boot the server, exit from Maintenance Mode or reconnect the host in vCenter and the host pop up automatically in the vSphere cluster
You have the same ESXi host on different hardware that's the beauty of stateless of UCS blades.

the end.

My way...

Tuesday, 13 January 2015

In vSphere ESXi vmkernel.log we see messages every 30 minutes: NMP: nmp_ThrottleLogForDevice:2331 Cmd 0x85

Thursday, 1 January 2015

Why hot removing running Cisco UCS blade from chassis to test vSphere HA is a bad idea.