Virtual disks fail during scheduled failover


i have issue on storage server 2012 r2 cluster have not been able resolve. first, here's setup:

3x hpe storeeasy 38xx gateway storage (2u dual socket dl360 essentially)

each server using on-board quad-port intel nic; 2 interfaces iscsi connectivity, 2 lan communication, in lacp team (iscsi interfaces are not teamed) 

storage hpe msa 2040 800tb raw disk, connected via 1gb iscsi. volumes on msa presented servers 2 20tb volumes @ time, 1 in each disk pool on msa (mpio configured). volumes placed in 1 of two clustered storage spaces pools. there, virtual disks created in pool , added cluster use storage resources in file server roles (not csv disks). virtual disks formatted ntfs, starting @ 10tb 16k allocation unit sizes configured on each 1 during formatting can grow 64 tb. mounted using mount points, , mount point disk not virtual volume, standard iscsi volume presented directly msa. once added clustered file server role, disks assigned hidden share names , abstracted using standalone dfs role, part of same cluster, unified file share space. each ntfs volume configured shadow copies enabled , deduplication enabled (set standard/general file server).

this seems work fine. until fail on 1 of file server roles.

at time, there 2 file server roles, 1 of them more storage assigned other. clustered file server role 1 32tb primary storage volume seems fail on node fine, , without issue. other clustered file server role bugbear. when initiate failover, disks within role take quite time go offline (2-4 mins), , when go offline, have major issue when coming online. when attempt come online on whatever node have designated take on role, or of them invariably fail , show either 'failed' or 'offline'. when @ 'additional information' attempt ascertain reason, indicates i/o error on disk, odd considering disk virtual volume within pool, , pool accessible , healthy. when troubleshooting myself, have been able remove problematic volumes cluster entirely, assign 1 of nodes in cluster read/write server disk, , mount disk on node. have been able browse disk explorer, read , write without issue. however, when place cluster, same error, , disk shows 'detached' in server manager view. if attempt attach there, gives me permissions issue, stating not have permission attach disk. if attempt bring disk online through failover clustering, same i/o error before , unable bring online.

the resolution this, far, has been reboot cluster nodes. then, resource disks not come online right away. @ first, attempt come online assigned 1 of 3 nodes in cluster. invariably, fail before , cluster node reboot on own due issue has disks (which have not been able chase down). after considerable time (3-6 hours), , multiple reboots each node in cluster, disks come online again , appear issue free. critically, have issue on cluster actual data exists (around 110 tb far). there cluster same config (same disk layout config, same backend storage), no actual data on it. cluster fails on in seconds no issue.

my theories far are:

1. vss choking on disks , sheer size. each 1 @ least 10tb, , there 10 of them within 1 role

2. deduplication on disks causing unseen issue

3. disk numbering how out of sorts , causing issue

so, ask of you... bright ideas???

everyone (or leo),

i had shadow copies and deduplication configured on volumes referenced in post. both of services require use of vss, albeit discreet services each. turned off deduplication volumes, , things appear behaving. cluster failing on correctly, , i'm not seeing disk failures during course of failover did before. intents , purposes, issue resolved, save no longer being able leverage space savings of deduplication (which upwards of 10% volumes).

is there official guidance on having both shadow copies , dedupe turned on large volumes? there 'rule of thumb' or similar? there way dedicate compute resources each vss service in question doesn't choke hard on larger volumes?



Windows Server  >  High Availability (Clustering)



Comments

Popular posts from this blog

server manager error: ADAM.events.xml could not be enumerated.

Cannot access Anywhere Access using domain name?

WMI Failure: Unable to update Local Resource Group