vSphere Metro Storage Cluster: SIOC blocking PDL state recovery

Share on:

During failover tests in a stretched metro cluster environment we ran into some problems when recovering from a Permanent Device Loss state (PDL).

The failover tests ran successfully.  The vSphere servers reacted as expected when testing a split-brain scenario on the VPLEX cluster. The VMs which where running in the same datacenter as the preferred VPLEX Node of there datastore, weren’t impacted. The VMs which weren’t not running at the datacenter of there preferred VPLEX node where stopped and restarted at the preferred site.

 12013-03-22T15:43:52.930Z cpu16:8878)PowerPath:Path "vmhba3:C0:T0:L4" is
 2getting into PDL state.\[2/4/3\]  
 32013-03-22T15:43:52.932Z cpu16:8878)PowerPath:Path "vmhba4:C0:T1:L4" is
 4getting into PDL state.\[2/4/3\]  
 52013-03-22T15:43:52.936Z cpu4:8878)PowerPath:Path "vmhba3:C0:T1:L4" is
 6getting into PDL state.\[2/4/3\]  
 72013-03-22T15:43:52.938Z cpu4:8878)PowerPath:Path "vmhba4:C0:T0:L4" is
 8getting into PDL state.\[2/4/3\]  
 92013-03-22T15:43:52.938Z cpu4:8878)WARNING: ScsiDevice: 1425: Device
10:naa.6000144000000010202925c828f4fac5 has been removed or is permanently
11inaccessible.  
12...  
132013-03-22T15:43:52.941Z cpu2:8878)ScsiDeviceIO: 2329:
14Cmd(0x412442b833c0) 0x2a, CmdSN 0x76e from world 8224 to dev
15"naa.6000144000000010202925c828f4fac5" failed H:0x0 D:0x2 P:0x0 Valid
16sense data: 0x2 0x4 0x3.  

When restoring the VPLEX cluster to an operational state we noticed some datastores stayed in inactive state, at there non preferred site.

The logs show PowerPath recognizing the paths coming alive again and the devices getting out of PDL state.

 12013-03-22T15:54:02.187Z cpu16:8804)PowerPath:Path vmhba4:C0:T0:L2 to
 2FNM00122400222 is alive.  
 32013-03-22T15:54:02.188Z cpu16:8804)PowerPath:Path "vmhba4:C0:T0:L4" is
 4getting out of PDL state.\[0/0/0\]  
 52013-03-22T15:54:02.189Z cpu16:8804)PowerPath:Path "vmhba4:C0:T1:L4" is
 6getting out of PDL state.\[0/0/0\]  
 72013-03-22T15:54:02.190Z cpu16:8804)PowerPath:Path "vmhba3:C0:T0:L4" is
 8getting out of PDL state.\[0/0/0\]  
 92013-03-22T15:54:02.191Z cpu16:8804)PowerPath:Path "vmhba3:C0:T1:L4" is
10getting out of PDL state.\[0/0/0\]  
112013-03-22T15:54:02.191Z cpu18:13178)ScsiDevice: 6061: Device
12naa.6000144000000010202925c828f4fac5 APD Notify PERM LOSS END; token
13num:1  
142013-03-22T15:54:02.191Z cpu16:8804)WARNING: ScsiDevice: 1448: Device
15naa.6000144000000010202925c828f4fac5 has been plugged back in after
16being marked permanently inaccessible. No data consistency guarantees.  
172013-03-22T15:54:02.196Z cpu16:8804)PowerPath:Path vmhba4:C0:T0:L4 to
18FNM00122400222 is alive.  
192013-03-22T15:54:02.201Z cpu16:8804)PowerPath:Path vmhba4:C0:T1:L2 to
20FNM00122400222 is alive.  
212013-03-22T15:54:02.205Z cpu16:8804)PowerPath:Path vmhba3:C0:T0:L2 to
22FNM00122400222 is alive.  
232013-03-22T15:54:02.211Z cpu16:8804)PowerPath:Path vmhba3:C0:T1:L2 to
24FNM00122400222 is alive.  
252013-03-22T15:54:02.212Z cpu16:8804)PowerPath:Path "vmhba4:C0:T1:L8" is
26getting out of PDL state.\[0/0/0\]  
272013-03-22T15:54:02.213Z cpu16:8804)PowerPath:Path "vmhba3:C0:T0:L8" is
28getting out of PDL state.\[0/0/0\]  
292013-03-22T15:54:02.214Z cpu16:8804)PowerPath:Path "vmhba3:C0:T1:L8" is
30getting out of PDL state.\[0/0/0\]  
312013-03-22T15:54:02.215Z cpu16:8804)PowerPath:Path "vmhba4:C0:T0:L8" is
32getting out of PDL state.\[0/0/0\]  
332013-03-22T15:54:02.215Z cpu0:8385)ScsiDevice: 6121: No Handlers
34registered!  
352013-03-22T15:54:02.215Z cpu16:8804)WARNING: ScsiDevice: 1448: Device
36naa.6000144000000010202925c828f4facd has been plugged back in after
37being marked permanently inaccessible. No data consistency guarantees.  
382013-03-22T15:54:02.215Z cpu0:8385)ScsiDevice: 6061: Device
39naa.6000144000000010202925c828f4facd APD Notify PERM LOSS END; token
40num:1  
412013-03-22T15:54:02.221Z cpu16:8804)PowerPath:Path vmhba4:C0:T1:L8 to
42FNM00122400222 is alive.  
432013-03-22T15:54:02.225Z cpu16:8804)PowerPath:Path vmhba3:C0:T0:L8 to
44FNM00122400222 is alive.  
452013-03-22T15:54:02.233Z cpu16:8804)PowerPath:Path vmhba3:C0:T1:L8 to
46FNM00122400222 is alive.  
472013-03-22T15:54:02.238Z cpu16:8804)PowerPath:Path vmhba4:C0:T1:L6 to
48FNM00122400222 is alive.  
492013-03-22T15:54:02.242Z cpu16:8804)PowerPath:Path vmhba3:C0:T0:L6 to
50FNM00122400222 is alive.  

vCenter still showed three datastores as inactive at both sites.

[

Image not found

Web path: http://vermost.files.wordpress.com/2013/04/dc1pdl1.png

Disk path: /static/http://vermost.files.wordpress.com/2013/04/dc1pdl1.png

Using Page Bundles: false

]

[

Image not found

Web path: http://vermost.files.wordpress.com/2013/04/dc6pdl1.png

Disk path: /static/http://vermost.files.wordpress.com/2013/04/dc6pdl1.png

Using Page Bundles: false

]

The logs showed following errors regarding these datastores.

 12013-03-22T15:54:07.025Z cpu16:8208)ScsiDeviceIO: 2329:
 2Cmd(0x412481a79e00) 0x2a, CmdSN 0x782 from world 9236 to dev
 3"naa.6000144000000010202925c828f4fac5" failed H:0x1 D:0x0 P:0x0 Possible
 4sense data: 0x0 0x0 0x0.  
 5...  
 62013-03-22T15:59:58.517Z cpu20:9246)ScsiDevice: 5261:
 7naa.6000144000000010202925c828f4fac5 device :Open count > 0, cannot
 8be brought online  
 92013-03-22T15:59:58.522Z cpu20:9246)ScsiDevice: 5261:
10naa.6000144000000010202925c828f4fac9 device :Open count > 0, cannot
11be brought online  
122013-03-22T15:59:58.532Z cpu20:9246)ScsiDevice: 5261:
13naa.6000144000000010202925c828f4fad1 device :Open count > 0, cannot
14be brought online  
152013-03-22T15:59:58.696Z cpu20:9246)Vol3: 692: Couldn't read volume
16header from control: Not supported  
172013-03-22T15:59:58.696Z cpu20:9246)Vol3: 692: Couldn't read volume
18header from control: Not supported  

When looking at the datastores we noticed that the datastores that automatically recovered had Storage IO Control (SIOC) disabled while the other had SIOC enabled.

After we disbabled SIOC on the inactive datastores they became active again.

KB article 2032690 describes this issue where SIOC is blocking the recovery of datastores in a PDL state.

Another possible cause is decribed in KB article 2014155 where a VM is blocking the recovery from a PDL state when using RDMs.