Thursday 1 January 2015

Why hot removing running Cisco UCS blade from chassis to test vSphere HA is a bad idea.

I had a kind of unpleasant discussion with 'Senior Admin' about simulating blade failure by hot removing it from chassis. I consider it a bad idea. First of all,  hot removing blades from chassis is very poor simulation of blade failure. The blade hardware component which will have the highest probability to fail will be memory DIMM module. This is because we have more and more DIMM modules in a blade with bigger and bigger capacity per module and our world has an entropy built-in by design... I hardly believe in escaping from slots blades as a hardware failure...

The main complain of 'Senior Admin' was that after running the test 'simulating' blade failure by plugging out/plugging in a blade into chassis slot one of 8 blade went into inoperable state and required one more full power-cycle - basically required re-acknowledgement from UCSM. 'Senior Admin' believed that blades were hot-swappable and that should have never happened, particularly because this blade was going into production. Yes, blades of all mainstream vendors are hot-swappable or hot-pluggable but only from chassis perspective, it means that to remove or to add new blade you don't have to power down entire chassis. It is just common sense to power down the blade during removing to avoid any data corruption on your OS which is using this blade server and to avoid human error !!! I would think twice if some 'Senior Admin' asked me to remove running blade in datacenter with thousands blades. Dear 'Senior', believe me, they all look the same with blinking LEDs ;-) So even when you 'simulate' the failure - just simply power off and HA will kick off.

Another aspect of this test is that we live in times when hardware is cheap and people are expensive (I know nobody believes me ;-)) it is not unusual that one sysadmin manage 10k servers. Please see link below:

http://highscalability.com/blog/2013/11/19/we-finally-cracked-the-10k-problem-this-time-for-managing-se.html 
  
Dear 'Senior Admin' walking to datacenter to remove blades for 'failure simulation' before we go to production  is aaa ... stupid idea?


All vendors to replace/remove the blade from chassis require power off of blade:

1.) Removing HP BL460cG8



2.) I love IBM documentation, how to remove IBM blade:
https://publib.boulder.ibm.com/infocenter/bladectr/documentation/index.jsp?topic=/com.ibm.bladecenter.hs22.doc/dw1iu_t_removing_the_blade_server.html 


3.) How to remove Cisco B200M3 from chassis:



Hmm... but we know that UCS is slightly different and all 'brain power' of UCS is in Fabric Interconnect it is why we need to do some tasks on UCSM. 

RTFM : http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/gui/config/guide/2-1/b_UCSM_GUI_Configuration_Guide_2_1.html

It seems that 'Senior Admin' missed one step ;-)





BTW Cisco documentation is a little bit tricky. You can safely ignore the section 'Before You Begin'...



But what if you remove server without decommissioning it first, well you have to follow this procedure:




I would say they were lucky that 7 blades didn't require re-acknowledgment.

Okay, stop ranting - nobody is perfect even 'Senior Admin' ;-) What I suggest before putting blade into production:

1.) RTFM !!! if you really love removing blades from chassis read documentation first how to do it correctly.

2.) Before going into production burn-in blade servers all vendors have diagnostics tool, please run it e.g. Cisco blades:

http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/sw/ucs_diagnostics/b_UCS_Blade_Server_Diagnostics_User_Guide/b_UCS_Blade_Server_Diagnostics_User_Guide_chapter_01.html

3.) Before putting server into production run memory test for at least 72 hours using e.g. memtest86+
http://www.memtest.org/ 

4.) We have much more elegant procedure to simulate blade failure.We can do it by triggering PSOD on ESXi host in controlled way. In ~95% cases of all  blade hardware failures ESXi host will manifest with Purple Screen Of Death PSOD e.g. DIMM failure, motherboard failure, CPU failue etc. You do not need any mechanical procedure to plugging-out/plugging-in running blade and wasting your time in unhealthy datacenter environment.

a.)    Connect to esxi host via ssh

b.)    Run from command-line:


    ~ # vsish -e set /reliability/crashMe/Panic 1

 
c.)    On KVM console you will see PSOD like below:



d.) Wait on HA event and reboot the blade or you can use the coolness of Service Profiles if you booting from SAN:

  • You can put affected ESXi host into Maintenance Mode (if you can) and then power down.
  • Disassociate Service Profile from affected blade (UCSM will attempt to gracefully shutdown or power off the blade)
  • Associate the Service Profile of affected blade with spare blade it is around 5 minutes to spin up the new blade server in UCSM.
  • Boot the server, exit from Maintenance Mode or reconnect the host in vCenter and the host pop up automatically in the vSphere cluster
  • You have the same ESXi host on different hardware that's the beauty of stateless of UCS blades.  


the end.





No comments:

Post a Comment