Recently did an ESXi patch run, along with some BIOS and firmware updates on a ScaleIO 2.x environment (more precisely 2.0.5014.0). The environment consists of some Dell PowerEdge servers, some of which are ESXi 6.0 build 3380124, some are Linux based, non-virtualized hosts. Luckily this environment was ScaleIO 2.x, because this version has a real maintenance mode in it (1.3.x did not). This means that while I can only patch one host at a time in this layout, I can do it fairly quickly and in a controlled fashion.
ScaleIO Maintenance Mode vs. ESXi Maintenance Mode
These are, obviously, two different things. With ScaleIO maintenance mode, you can put one SDS (providing storage services) host (at least in this configuration with two MDM’s) at a time into maintenance mode, which does not have an adverse impact on the cluster. The remaining SDS will take care of operations, provided it too does not break or go down at the same time. After you are done patching, you exit maintenance mode, which the makes sure all changes are rebuilt and synced across the cluster nodes. This takes some time depending on the amount of data involved.
ESXi maintenance mode on the other hand, deals with putting the VMware hypervisor layer into maintenance mode so you can patch and perform other operations on it with no VMs running. The order is:
- VMware ESXi
And when coming out of the maintenance break, it’s the reverse.
I left the SVM (virtual machine on the host which takes care of the different functions that the host has, technically a SLES appliance) that I was patching, but I powered it down gracefully before putting the host into maintenance mode.
So accounting for all these things, my order was:
- Migrate all running VMs except the SVM off of the host using vMotion
- When the host is empty (bar the SVM), put ScaleIO into maintenance mode
- This is done via the ScaleIO GUI application, on the Backend page, by right clicking on the host. I did not have to use the force option, and neither should you…
- Shut down the SVM via “Shut Down Guest” in vCenter
- Put the host into maintenance mode without moving the SVM off the host (I suppose you could move it, but I didn’t)
- Scan and Remediate and install other patches (I installed BIOS, iDRAC and some other various updates via iDRAC; I had set them to “Install next reboot” so they would be installed during the same reboot as ESXi does remediation)
- Once you are satisfied, take the host out of maintenance mode
- Start the SVM on that host
- Wait for it to boot
- Exit ScaleIO maintenance mode (see 2.)
- Check to see that rebuild goes through (ScaleIO GUI application, either the Dashboard or Backend page)
- Make sure all warnings and errors clear. During host remediation and patching, I had the following errors
- High – MDM isn’t clustered (this is because you’ve shut down one of the SVMs containing the MDM role)
- Medium – SDS is disconnected (for the host being remediated)
- Low – SDS is in maintenance mode (for the host being remediated)
- After the SVM starts, it should clear all but the last alert, and once you have Exited Maintenance Mode, the final alert should clear
(Expected) Alerts during maintenance
As mentioned, you will have alerts and warnings during this operation. I had the following:
- First, when putting the SDS into maintenance mode in ScaleIO, one warning about SDS being in maintenance mode:
- After SVM is shut down and ESXi is also placed in maintenance, two more:
- Then once you have remediated and taken the host out of maintenance, and started the SVM, you’re back to one, as in the first picture.
- When you take the SDS out of maintenance, it will clear the last alert
Note that the highest rated alert, the Critical “MDM isn’t clustered” is actually noteworthy. It means that the SDS you are taking down for maintenance had the MDM role (critical for management of ScaleIO). Normally you’d have another one, and you shouldn’t proceed with any of this if you can only find one MDM, or if you already had this (or any other alert).
EMC has this to say about MDM’s (also see the document h14036-emc-scaleio-operation-ensuring-non-disruptive-operation-upgrade.pdf):
Currently, an MDM can manage up to 1024 servers. When several MDMs are present, an SDC may be managed by several MDMs, whereas, an SDS can only belong to one MDM. ScaleIO version 2.0 and later supports five MDMs (with a minimum of three) where we define a Master, Slave and Tie-breaker MDM.
Roles / Elements in ScaleIO
You can see the installed roles in VMware in the notes field, like so:
Elements or roles are (may not be a complete list):
- MASTER_MDM – Master MDM node, Meta Data Manager, enables monitoring and configuration changes
- SLAVE_MDM – Secondary MDM node, will take over if Master is unavailable
- SDS – Storage node, ScaleIO Data Server, provides storage services through HDD, SSD, NVMe etc.
- SDC – ScaleIO Data Client, consumer of resources (e.g. a virtualization host)
- RFCACHE – Read-only cache consisting of SSD or Flash
- RMCACHE – RAM based cache
- LIA – Light installation agent (on all nodes, creates a trust between node and Installation Manager)
- TB – Tiebreaker, in case of conflicts inside cluster, counted as a type of MDM, non critical except in HA/conflict situations
ESXi funny business…
While running remediate on the hosts, every single one failed when installing patches.
A very scary looking Fatal Error 15. However, there’s a KB on this here.
So, (warm) reboot the host again, wait for ESXi to load the old pre-update version, and do a re-remediate without using the Stage option first. I used stage, as I’m used to, apparently this breaks. Sometimes.
And to re-iterate, I was patching using vCenter Update Manager (or VUM) from 6.0 build 3380124 to 5050593.
docu82353_ScaleIO-Software-2.0.1.x-Documentation-set.zip from support.emc.com (not actually for the version in use, but similar enough in this case. Use at your own risk..
ScaleIO v2.0.x User Guide.pdf contained in the above mentioned