vSphere Metro Storage Cluster: SIOC blocking PDL state recovery

During failover tests in a stretched metro cluster environment we ran into some problems when recovering from a Permanent Device Loss state (PDL).

The failover tests ran successfully.  The vSphere servers reacted as expected when testing a split-brain scenario on the VPLEX cluster. The VMs which where running in the same datacenter as the preferred VPLEX Node of there datastore, weren’t impacted. The VMs which weren’t not running at the datacenter of there preferred VPLEX node where stopped and restarted at the preferred site.

2013-03-22T15:43:52.930Z cpu16:8878)PowerPath:Path "vmhba3:C0:T0:L4" is
getting into PDL state.\[2/4/3\]  
2013-03-22T15:43:52.932Z cpu16:8878)PowerPath:Path "vmhba4:C0:T1:L4" is
getting into PDL state.\[2/4/3\]  
2013-03-22T15:43:52.936Z cpu4:8878)PowerPath:Path "vmhba3:C0:T1:L4" is
getting into PDL state.\[2/4/3\]  
2013-03-22T15:43:52.938Z cpu4:8878)PowerPath:Path "vmhba4:C0:T0:L4" is
getting into PDL state.\[2/4/3\]  
2013-03-22T15:43:52.938Z cpu4:8878)WARNING: ScsiDevice: 1425: Device
:naa.6000144000000010202925c828f4fac5 has been removed or is permanently
inaccessible.  
...  
2013-03-22T15:43:52.941Z cpu2:8878)ScsiDeviceIO: 2329:
Cmd(0x412442b833c0) 0x2a, CmdSN 0x76e from world 8224 to dev
"naa.6000144000000010202925c828f4fac5" failed H:0x0 D:0x2 P:0x0 Valid
sense data: 0x2 0x4 0x3.  

When restoring the VPLEX cluster to an operational state we noticed some datastores stayed in inactive state, at there non preferred site.

The logs show PowerPath recognizing the paths coming alive again and the devices getting out of PDL state.

2013-03-22T15:54:02.187Z cpu16:8804)PowerPath:Path vmhba4:C0:T0:L2 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.188Z cpu16:8804)PowerPath:Path "vmhba4:C0:T0:L4" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.189Z cpu16:8804)PowerPath:Path "vmhba4:C0:T1:L4" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.190Z cpu16:8804)PowerPath:Path "vmhba3:C0:T0:L4" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.191Z cpu16:8804)PowerPath:Path "vmhba3:C0:T1:L4" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.191Z cpu18:13178)ScsiDevice: 6061: Device
naa.6000144000000010202925c828f4fac5 APD Notify PERM LOSS END; token
num:1  
2013-03-22T15:54:02.191Z cpu16:8804)WARNING: ScsiDevice: 1448: Device
naa.6000144000000010202925c828f4fac5 has been plugged back in after
being marked permanently inaccessible. No data consistency guarantees.  
2013-03-22T15:54:02.196Z cpu16:8804)PowerPath:Path vmhba4:C0:T0:L4 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.201Z cpu16:8804)PowerPath:Path vmhba4:C0:T1:L2 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.205Z cpu16:8804)PowerPath:Path vmhba3:C0:T0:L2 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.211Z cpu16:8804)PowerPath:Path vmhba3:C0:T1:L2 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.212Z cpu16:8804)PowerPath:Path "vmhba4:C0:T1:L8" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.213Z cpu16:8804)PowerPath:Path "vmhba3:C0:T0:L8" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.214Z cpu16:8804)PowerPath:Path "vmhba3:C0:T1:L8" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.215Z cpu16:8804)PowerPath:Path "vmhba4:C0:T0:L8" is
getting out of PDL state.\[0/0/0\]  
2013-03-22T15:54:02.215Z cpu0:8385)ScsiDevice: 6121: No Handlers
registered!  
2013-03-22T15:54:02.215Z cpu16:8804)WARNING: ScsiDevice: 1448: Device
naa.6000144000000010202925c828f4facd has been plugged back in after
being marked permanently inaccessible. No data consistency guarantees.  
2013-03-22T15:54:02.215Z cpu0:8385)ScsiDevice: 6061: Device
naa.6000144000000010202925c828f4facd APD Notify PERM LOSS END; token
num:1  
2013-03-22T15:54:02.221Z cpu16:8804)PowerPath:Path vmhba4:C0:T1:L8 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.225Z cpu16:8804)PowerPath:Path vmhba3:C0:T0:L8 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.233Z cpu16:8804)PowerPath:Path vmhba3:C0:T1:L8 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.238Z cpu16:8804)PowerPath:Path vmhba4:C0:T1:L6 to
FNM00122400222 is alive.  
2013-03-22T15:54:02.242Z cpu16:8804)PowerPath:Path vmhba3:C0:T0:L6 to
FNM00122400222 is alive.  

vCenter still showed three datastores as inactive at both sites.

[dc1-pdl]

[dc6-pdl]

The logs showed following errors regarding these datastores.

2013-03-22T15:54:07.025Z cpu16:8208)ScsiDeviceIO: 2329:
Cmd(0x412481a79e00) 0x2a, CmdSN 0x782 from world 9236 to dev
"naa.6000144000000010202925c828f4fac5" failed H:0x1 D:0x0 P:0x0 Possible
sense data: 0x0 0x0 0x0.  
...  
2013-03-22T15:59:58.517Z cpu20:9246)ScsiDevice: 5261:
naa.6000144000000010202925c828f4fac5 device :Open count > 0, cannot
be brought online  
2013-03-22T15:59:58.522Z cpu20:9246)ScsiDevice: 5261:
naa.6000144000000010202925c828f4fac9 device :Open count > 0, cannot
be brought online  
2013-03-22T15:59:58.532Z cpu20:9246)ScsiDevice: 5261:
naa.6000144000000010202925c828f4fad1 device :Open count > 0, cannot
be brought online  
2013-03-22T15:59:58.696Z cpu20:9246)Vol3: 692: Couldn't read volume
header from control: Not supported  
2013-03-22T15:59:58.696Z cpu20:9246)Vol3: 692: Couldn't read volume
header from control: Not supported  

When looking at the datastores we noticed that the datastores that automatically recovered had Storage IO Control (SIOC) disabled while the other had SIOC enabled.

After we disbabled SIOC on the inactive datastores they became active again.

KB article 2032690 describes this issue where SIOC is blocking the recovery of datastores in a PDL state.

Another possible cause is decribed in KB article 2014155 where a VM is blocking the recovery from a PDL state when using RDMs.

Share this post:

Social