Cluster Shared Volume (CSV) Inside Out

December 2, 2013, 11:10 am

≫ Next: Understanding the state of your Cluster Shared Volumes in Windows Server 2012 R2

≪ Previous: Decoding Bugcheck 0x0000009E

In this blog we will take a look under the hood of the cluster file system in Windows Server 2012 R2 called Cluster Shared Volumes (CSV). This blog post is targeted at developers and ISV’s who are looking to integrate their storage solutions with CSV.

Note: Throughout this blog, I will refer to C:\ClusterStorage assuming that the Windows is installed on the C:\ drive. Windows can be installed on any available drive and the CSV namespace will be built on the system drive, but instead of using %SystemDrive%\ClusterStorage\ I’ve used C:\ClusteredStorage for better readability since C:\ is used as the system drive most of the time.

Components

Cluster Shared Volume in Windows Server 2012 is a completely re-architected solution from Cluster Shared Volumes you knew in Windows Server 2008 R2. Although it may look similar in the user experience – just a bunch of volumes mapped under the C:\ClusterStorage\ and you are using regular windows file system interface to work with the files on these volumes, under the hood, these are two completely different architectures. One of the main goals is that in Windows Server 2012, CSV has been expanded beyond the Hyper-V workload, for example Scale-out File Server and in Windows Server 2012 R2 CSV is also supported with SQL Server 2014.

First, let us look under the hood of CsvFs at the components that constitute the solution.

Figure 1 CSV Components and Data Flow Diagram

The diagram above shows a 3 node cluster. There is one shared disk that is visible to Node 1 and Node 2. Node 3 in this diagram has no direct connectivity to the storage. The disk was first clustered and then added to the Cluster Shared Volume. From the user’s perspective, everything will look the same as in the Windows 2008 R2. On every cluster node you will find a mount point to the volume: C:\ClusterStorage\Volume1. The “VolumeX” naming can be changed, just use Windows Explorer and rename like you would any other directory. CSV will then take care of synchronizing the updated name around the cluster to ensure all nodes are consistent. Now let’s look at the components that are backing up these mount points.

Terminology

The node where NTFS for the clustered CSV disk is mounted is called the Coordinator Node. In this context, any other node that does not have clustered disk mounted is called Data Servers (DS). Note that coordinator node is always a data server node at the same time. In other words, coordinator is a special data server node when NTFS is mounted.

If you have multiple disks in CSV, you can place them on different cluster nodes. The node that hosts a disk will be a Coordinator Node only for the volumes that are located on that disk. Since each node might be hosting a disk, each of them might be a Coordinator Node, but for different disks. So technically, to avoid ambiguity, we should always qualify “Coordinator Node” with the volume name. For instance we should say: “Node 2 is a Coordinator Node for the Volume1”. Most of the examples we will go through in this blog post for simplicity will have only one CSV disk in the cluster so we will drop the qualification part and will just say Coordinator Node to refer to the node that has this disk online.

Sometimes we will use terms “disk” and “volume” interchangeably because in the samples we will be going through one disk will have only one NTFS volume, which is the most common deployment configuration. In practice, you can create multiple volumes on a disk and CSV fully supports that as well. When you move a disk ownership from one cluster node to another, all the volumes will travel along with the disk and any given node will be the coordinator for all volumes on a given disk. Storage Spaces would be one exception from that model, but we will ignore that possibility for now.

This diagram is complicated so let’s try to break it up to the pieces, and discuss each peace separately, and then hopefully the whole picture together will make more sense.

On the Node 2, you can see following stack that represents mounted NTFS. Cluster guarantees that only one node has NTFS in the state where it can write to the disk, this is important because NTFS is not a clustered file system. CSV provides a layer of orchestration that enables NTFS or ReFS (with Windows Server 2012 R2) to be accessed concurrently by multiple servers. Following blog post explains how cluster leverages SCSI-3 Persistent Reservation commands with disks to implement that guarantee http://blogs.msdn.com/b/clustering/archive/2009/03/02/9453288.aspx .

Figure 2 CSV NTFS stack

Cluster makes this volume hidden so that Volume Manager (Volume in the diagram above) does not assign a volume GUID to this volume and there will be no drive letter assigned. You also would not see this volume using mountvol.exe or using FindFirstVolume() and FindNextVolume() WIN32 APIs.

On the NTFS stack the cluster will attach an instance of a file system mini-filter driver called CsvFlt.sys at the altitude 404800. You can see that filter attached to the NTFS volume used by CSV if you run following command:

>fltmc.exe instances
Filter Volume Name Altitude Instance Name
-------------------- ------------------------------------- ------------ ----------------------
<skip>
CsvFlt \Device\HarddiskVolume7 404800 CsvFlt Instance
<skip>

Applications are not expected to access the NTFS stack and we even go an extra mile to block access to this volume from the user mode applications. CsvFlt will check all create requests coming from the user mode against the security descriptor that is kept in the cluster public property SharedVolumeSecurityDescriptor. You can use power shell cmdlet “Get-Cluster | fl SharedVolumeSecurityDescriptor” to get to that property. The output of this PowerShell cmdlet shows value of the security descriptor in self-relative binary format (http://msdn.microsoft.com/en-us/library/windows/desktop/aa374807(v=vs.85).aspx):

PS D:\Windows\system32> Get-Cluster | fl SharedVolumeSecurityDescriptor

SharedVolumeSecurityDescriptor : {1, 0, 4, 128...}

CsvFlt plays several roles:

Provides an extra level of protection for the hidden NTFS volume used for CSV
Helps provide a local volume experience (after all CsvFs does look like a local volume). For instance you cannot open volume over SMB or read USN journal. To enable these kinds of scenarios CsvFs often times marshals the operation that need to be performed to the CsvFlt disguising it behind a tunneling file system control. CsvFlt is responsible for converting the tunneled information back to the original request before forwarding it down-the stack to NTFS.
It implements several mechanisms to help coordinate certain states across multiple nodes. We will touch on them in the future posts. File Revision Number is one of them for example.

The next stack we will look at is the system volume stack. On the diagram above you see this stack only on the coordinator node which has NTFS mounted. In practice exactly the same stack exists on all nodes.

Figure 3 System Volume Stack

The CSV Namespace Filter (CsvNsFlt.sys) is a file system mini-filter driver at an altitude of 404900:

D:\Windows\system32>fltmc instances
Filter Volume Name Altitude Instance Name
-------------------- ------------------------------------- ------------ ----------------------
<skip>
CsvNSFlt C: 404900 CsvNSFlt Instance
<skip>

CsvNsFlt plays the following roles:

It protects C:\ClusterStorage by blocking unauthorized attempts that are not coming from the cluster service to delete or create any files or subfolders in this folder or change any attributes on the files. Other than opening these folders about the only other operation that is not blocked is renaming the folders. You can use command prompt or explorer to rename C:\ClusterStorage\Volume1 to something like C:\ClusterStorage\Accounting. The directory name will be synchronized and updated on all nodes in the cluster.
It helps us to dispatch the block level redirected IO. We will cover this in more details when we talk about the block level redirected IO later on in this post.

The last stack we will look at is the stack of the CSV file system. Here you will see two modules CSV Volume Manager (csvvbus.sys), and CSV File System (CsvFs.sys). CsvFs is a file system driver, and mounts exclusively to the volumes surfaced up by CsvVbus.

Figure 5 CsvFs stack

Data Flow

Now that we are familiar with the components and how they are related to each other, let’s look at the data flow.

First let’s look at how Metadata flows. Below you can see the same diagram as on the Figure 1. I’ve just kept only the arrows and blocks that is relevant to the metadata flow and removed the rest from the diagram.

Figure 6 Metadata Flow

Our definition of metadata operation is everything except read and write. Examples of metadata operation would be create file, close file, rename file, change file attributes, delete file, change file size, any file system control, etc. Some writes may also, as a side effect cause a metadata change. For instance, an extending write will cause CsvFs to extend all or some of the following: file allocation size, file size and valid data length. A read might cause CsvFs to query some information from NTFS.

On the diagram above you can see that metadata from any node goes to the NTFS stack on Node 2. Data server nodes (Node 1 and Node 3) are using Server Message Block (SMB) as a protocol to forward metadata over.

Metadata are always forwarded to NTFS. On the coordinator node CsvFs will forward metadata IO directly to the NTFS volume while other nodes will use SMB to forward the metadata over the network.

Next, let’s look at the data flow for the Direct IO. The following diagram is produced from the diagram on the Figure 1 by removing any blocks and lines that are not relevant to the Direct IO. By definition Direct IO are the reads and writes that never go over the network, but go from CsvFs through CsvVbus straight to the disk stack. To make sure there is no ambiguity I’ll repeat it again: - Direct IO bypasses volume stack and goes directly to the disk.

Figure 7 Direct IO Flow

Both Node 1 and Node 2 can see the shared disk - they can send reads and writes directly to the disk completely avoiding sending data over the network. The Node 3 is not in the diagram on the Figure 7 Direct IO Flow since it cannot perform Direct IO, but it is still part of the cluster and it will use block level redirected IO for reads and writes.

The next diagram shows a File SystemRedirected IO request flows. The diagram and data flow for the redirected IO is very similar to the one for the metadata from the Figure 6 Metadata Flow:

Figure 8 File System Redirected IO Flow

Later we will discuss when CsvFs uses the file system redirected IO to handle reads and writes and how it compares to what we see on the next diagram – Block Level Redirected IO:

Figure 9 Block Level Redirected IO Flow

Note that on this diagram I have completely removed CsvFs stack and CSV NTFS stack from the Coordinator Node leaving only the system volume NTFS stack. The CSV NTFS stack is removed because Block Level Redirected IO completely bypasses it and goes to the disk (yes, like Direct IO it bypasses the volume stack and goes straight to the disk) below the NTFS stack. The CsvFs stack is removed because on the coordinating node CsvFs would never use Block Level Redirected IO, and would always talk to the disk. The reason why Node 3 would use Redirected IO, is because Node 3 does not have physical connectivity to the disk. A curious reader might wonder why Node 1 that can see the disk would ever use Block Level Redirected IO. There are at least two cases when this might be happening. Although the disk might be visible on the node it is possible that IO requests will fail because the adapter or storage network switch is misbehaving. In this case, CsvVbus will first attempt to send IO to the disk and on failure will forward the IO to the Coordinator Node using the Block Level Redirected IO. The other example is Storage Spaces - if the disk is a Mirrored Storage Space, then CsvFs will never use Direct IO on a data server node, but instead it will send the block level IO to the Coordinating Node using Block Level Redirected IO. In Windows Server 2012 R2 you can use the Get-ClusterSharedVolumeState cmdlet http://technet.microsoft.com/en-us/library/dn456528.aspx to query the CSV state (direct / file level redirected / block level redirected) and if redirected it will state why.

Note that CsvFs sends the Block Level Redirected IO to the CsvNsFlt filter attached to the system volume stack on the Coordinating Node. This filter dispatches this IO directly to the disk bypassing NTFS and volume stack so no other filters below the CsvNsFlt on the system volume will see that IO. Since CsvNsFlt sits at a very high altitude, in practice no one besides this filter will see these IO requests. This IO is also completely invisible to the CSV NTFS stack. You can think about Block Level Redirected IO as a Direct IO that CsvVbus is shipping to the Coordinating Node and then with the help of the CsvNsFlt it is dispatched directly to the disk as a Direct IO is dispatched directly to the disk by CsvVbus.

What are these SMB shares?

CSV uses the Server Message Block (SMB) protocol to communicate with the Coordinator Node. As you know, SMB3 requires certain configuration to work. For instance it requires file shares. Let’s take a look at how cluster configures SMB to enable CSV.

If you dump list of SMB file shares on a cluster node with CSV volumes you will see following:

> Get-SmbShare
Name                          ScopeName                     Path                          Description
----                          ---------                     ----                          -----------
ADMIN$                        *                            C:\Windows                    Remote Admin
C$                            *                             C:\                           Default share
ClusterStorage$               CLUS030512                  C:\ClusterStorage             Cluster Shared Volumes Def...
IPC$                          *                                                           Remote IPC

There is a hidden admin share that is created for CSV, shared as ClusterStorage$. This share is created by the cluster to facilitate remote administration. You should use it in the scenarios where you would normally use an admin share on any other volume (such as D$). This share is scoped to the Cluster Name. Cluster Name is a special kind of Network Name that is designed to be used to manage a cluster. You can learn more about Network Name in the following blog post http://blogs.msdn.com/b/clustering/archive/2009/07/17/9836756.aspx. You can access this share using the Cluster Name \\<cluster name>\ClusterStorage$

Since this is an admin share, it is ACL’d so only members of the Administrators group have full access to this share. In the output the access control list is defined using Security Descriptor Definition Language (SDDL). You can learn more about SDDL here http://msdn.microsoft.com/en-us/library/windows/desktop/aa379567(v=vs.85).aspx

ShareState            : Online
ClusterType           : ScaleOut
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           : Cluster Shared Volumes Default Share
EncryptData           : False
Name                  : ClusterStorage$
Path                  : C:\ClusterStorage
Scoped                : True
ScopeName             : CLUS030512
SecurityDescriptor    : D:(A;;FA;;;BA)

There are also couple hidden shares that are used by the CSV. You can see them if you add the IncludeHidden parameter to the get-SmbShare cmdlet. These shares are used only on the Coordinator Node. Other nodes either do not have these shares or these shares are not used:

> Get-SmbShare -IncludeHidden
Name                          ScopeName                     Path                          Description
----                          ---------                     ----                          -----------
17f81c5c-b533-43f0-a024-dc... *                             \\?\GLOBALROOT\Device\Hard...
ADMIN$                        *                             C:\Windows                    Remote Admin
C$                            *                             C:\                           Default share
ClusterStorage$               VPCLUS030512                  C:\ClusterStorage             Cluster Shared Volumes Def...
CSV$                          *                             C:\ClusterStorage
IPC$                          *                                                           Remote IPC

Each Cluster Shared Volume hosted on a coordinating node cluster creates a share with a name that looks like a GUID. This is used by CsvFs to communicate to the hidden CSV NTFS stack on the coordinating node. This share points to the hidden NTFS volume used by CSV. Metadata and the File System Redirected IO are flowing to the Coordinating Node using this share.

ShareState            : Online
ClusterType           : CSV
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           :
EncryptData           : False
Name                  : 17f81c5c-b533-43f0-a024-dc431b8a7ee9-1048576$
Path                  : \\?\GLOBALROOT\Device\Harddisk2\ClusterPartition1\
Scoped                : False
ScopeName             : *
SecurityDescriptor    : O:SYG:SYD:(A;;FA;;;SY)(A;;FA;;;S-1-5-21-2310202761-1163001117-2437225037-1002)
ShadowCopy            : False
Special               : True
Temporary             : True

On the Coordinating Node you also will see a share with the name CSV$. This share is used to forward Block Level Redirected IO to the Coordinating Node. There is only one CSV$ share on every Coordinating Node:

ShareState            : Online
ClusterType           : CSV
ShareType             : FileSystemDirectory
FolderEnumerationMode : Unrestricted
CachingMode           : Manual
CATimeout             : 0
ConcurrentUserLimit   : 0
ContinuouslyAvailable : False
CurrentUsers          : 0
Description           :
EncryptData           : False
Name                  : CSV$
Path                  : C:\ClusterStorage
Scoped                : False
ScopeName             : *
SecurityDescriptor    : O:SYG:SYD:(A;;FA;;;SY)(A;;FA;;;S-1-5-21-2310202761-1163001117-2437225037-1002)
ShadowCopy            : False
Special               : True
Temporary             : True

Users are not expected to use these shares - they are ACL’d so only Local System and Failover Cluster Identity user (CLIUSR) have access to the share.

All of these shares are temporary - information about these shares is not in any persistent storage, and when node reboots they will be removed from the Server Service. Cluster takes care of creating the shares every time during CSV start up.

Conclusion

You can see that that Cluster Shared Volumes in Windows Server 2012 R2 is built on a solid foundation of Windows storage stack, CSVv1, and SMB3.

Thanks!
Vladimir Petter
Principal Software Development Engineer
Clustering & High-Availability
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx

Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

↧

Understanding the state of your Cluster Shared Volumes in Windows Server 2012 R2

December 5, 2013, 3:06 pm

≫ Next: Understanding the Repair Active Directory Object Recovery Action

≪ Previous: Cluster Shared Volume (CSV) Inside Out

Cluster Shared Volumes (CSV) is the clustered file system for the Microsoft Private cloud, first introduced in Windows Server 2008 R2. In Windows Server 2012, we radically improved the CSV architecture. We presented a deep dive of these architecture improvements at TechEd 2012. Building on this new and improved architecture, in Windows Server 2012 R2, we have introduced several new CSV features. In this blog, I am going to discuss one of these new features – the new Get-ClusterSharedVolumeState Windows Server Failover Clustering PowerShell® cmdlet.This cmdlet enables you to view the state of your CSV. Understanding the state of your CSV is useful in troubleshooting failures as well as optimizing the performance of your CSV. In the remainder of this blog, I will explain how to use this cmdlet as well as how to interpret the information provided by the cmdlet.

Get-ClusterSharedVolumeState Windows PowerShell® cmdlet

The Get-ClusterSharedVolumeState cmdlet allows you to view the state of your CSV on a node in the cluster. Note that the state of your CSV can vary between the nodes of a cluster. Therefore, it might be useful to determine the state of your CSV on multiple or all nodes of your cluster.

To use the Get-ClusterSharedVolumeState cmdlet open a new Windows PowerShell console and run the following:

To view the state of all CSVs on all the nodes of your cluster

Get-ClusterSharedVolumeState

To view the state of all CSVs on a subset of the nodes in your cluster

Get-ClusterSharedVolumeState –Node clusternode1,clusternode2

To view the state of a subset of CSVs on all the nodes of your cluster

Get-ClusterSharedVolumeState –Name "Cluster Disk 2","Cluster Disk 3"

Get-ClusterSharedVolume "Cluster Disk 2" | Get-ClusterSharedVolumeState

Understanding the state of your CSV

The Get-ClusterSharedVolumeStatecmdlet output provides two important pieces of information for a particular CSV – the state of the CSV and the reason why the CSV is in that particular state. There are three states of a CSV – Direct, File System Redirected and Block Redirected. I will now examine the output of this cmdlet for each of these states.

Direct Mode

In Direct Mode, I/O operations from the application on the cluster node can be sent directly to the storage. It therefore, bypasses the NTFS or ReFS volume stack.

File System Redirected Mode

In File System Redirected mode, I/O on a cluster node is redirected at the top of the CSV pseudo-file system stack over SMB to the disk. This traffic is written to the disk via the NTFS or ReFS file system stack on the coordinator node.

Note:

When a CSV is in File System Redirected Mode, I/O for the volume will not be cached in the CSV Block Cache.
Data deduplication occurs on a per file basis. Therefore, when a file on a CSV volume is deduped, all I/O for that file will occur in File System Redirected mode. I/O for the file will not be cached in the CSV Block Cache – it will instead be cached in the Deduplication Cache. For the remaining non-deduped files, CSV will be in direct mode. The state of the CSV will be reflected as being in Direct mode.
The Failover Cluster Manager will show a volume as in Redirected Access only when it is in File System Redirected Mode and the FileSystemRedirectedIOReason is UserRequest.

Block Redirected Mode

In Block level redirected mode, I/O passes through the local CSVFS proxy file system stack and is written directly to Disk.sys on the coordinator node. As a result it avoids traversing the NTFS/ReFS file system stack twice.

In conclusion, the Get-ClusterSharedVolumeState cmdlet is a powerful tool that enables you to understand the state of your Cluster Shared Volume and thus troubleshoot failures and optimize the performance of your private cloud storage infrastructure.

Thanks!
Subhasish Bhattacharya
Program Manager
Clustering and High Availability
Microsoft

↧

Understanding the Repair Active Directory Object Recovery Action

December 13, 2013, 4:49 pm

≫ Next: How to Run ChkDsk and Defrag on Cluster Shared Volumes in Windows Server 2012 R2

≪ Previous: Understanding the state of your Cluster Shared Volumes in Windows Server 2012 R2

One of the responsibilities of cluster Network Name resource is to rotate the password of the computer object in Active Directory associated with it. When the Network Name resource is online, it will rotate the password according to domain and local machine policy (which is 30 days by default).

If the password is different from what is stored in the cluster database, the cluster service will be unable to logon to the computer object and the Network Name will fail to come online. This may also cause issues such as Kerberos errors, failure to register in a secure DNS zone, and live migration to fail.

The Repair Active Directory Object option is a recovery tool to re-synchronize the password for cluster computer objects. It can be found in Failover Cluster Manager (CluAdmin.msc) by right-clicking on the Network Name, selecting More Actions…, and then clicking Repair Active Directory Object.

Cluster Name Object (CNO) - The CNO is the computer object associated with the Cluster Name resource. When using Repair on the Cluster Name, it will use the credentials of the currently logged on user and reset the computer objects password. To run Repair, you must have the "Reset Password" permissions to the CNO computer object.
Virtual Computer Object (VCO) - The CNO is responsible for managing the passwords on all other computer objects (VCO's) for other cluster network names in the cluster. If the password for a VCO falls out of sync, the CNO will reset the password and self-heal automatically. Therefore it is not needed to run Repair to reset the password for a VCO. In Windows Server 2012 a Repair action was added for all other cluster Network Names, and is a little bit different. Repair will check to see if the associated computer object exists in Active Directory. If the VCO had been accidentally deleted, then using Repair will re-create the computer object if it is missing. The recommended process to recover deleted computer objects is with the AD Recycle Bin feature, using Repair to re-create computer objects when they have been deleted should be a last resort recovery action. This is because some applications store attributes in the computer object (namely MSMQ), and recreating a new computer object will break the application. Repair is a safe action to perform on any SQL Server, or File Server deployment. The CNO must have "Create Computer Objects" permissions on the OU in which it resides to recreate the VCO's.

To run Repair, the Network Name resource must be in a "Failed" or "Offline" state. Otherwise the option will be grayed out.

Repair is only available through the Failover Cluster Manager snap-in, there is no Powershell cmdlet available to script the action.

If you are running Windows Server 2012 and find that you are having to repeatidly run Repair every ~30 days, ensure you have hotfix KB2838043 installed.

Matt Kurjanowicz
Senior Software Development Engineer
Clustering & High-Availability
Microsoft

↧

How to Run ChkDsk and Defrag on Cluster Shared Volumes in Windows Server 2012 R2

January 1, 2014, 8:59 pm

≫ Next: Event ID 5120 in System Event Log

≪ Previous: Understanding the Repair Active Directory Object Recovery Action

Cluster Shared Volumes (CSV) is a layer of abstraction on either the ReFS or NTFS file system (which is used to format the underlying private cloud storage). Just as with a non-CSV volume, at times it may be necessary to run ChkDsk and Defrag on the file system. In this blog, I am going to first address the recommended procedure to run Defrag on your CSV, in Windows Server 2012 R2. I will then discuss how ChkDsk is run on your CSVs.

Procedure to run Defrag on your CSV:

Fragmentation of files on a CSV can impact the perceived file system performance by increasing the seek time to retrieve file system metadata. It is therefore recommended to periodically run Defrag on your CSV volume. Fragmentation is primarily a concern when running dynamic VHDs and less prevalent with static VHDs. On a stand-alone server defrag runs as part of the “Maintenance Task”, so it runs automatically. However, on a CSV volume it will never run automatically, so you need to run it manually or script it to run (potentially using a Clustered Scheduled Task). It is recommended to conduct this process during non-peak production times, as performance may be impacted. The following are the steps to defragment your CSV:

1. Determine if defragmentation is required for your CSV by running the following on an elevated command prompt:

Defrag.exe <CSV Mount Point> /A /U /V

/A Perform analysis on the specified volumes

/U Print the progress of the operation on the screen

/V Print verbose output containing the fragmentation statistics

Note:

If your CSV is backed by thinly provisioned storage, slab consolidation analysis (not the actual slab consolidation) is run during defrag analysis. Slab consolidation analysis requires the CSV to be placed in redirected mode before execution. Please refer to step 2, for instructions on how to place your CSV into redirected mode.

2. If defragmentation is required for your CSV, put the CSV into redirected mode. This can be achieve in either of the following ways:

a. Using Windows PowerShell^©open a new elevated Windows PowerShell console and run the following:

Suspend-ClusterResource<Cluster Disk Name> -RedirectedAccess

b. Using the Failover Cluster Manager right-click on the CSV and select “Turn On Redirected Access”:

Note:

If you attempt to run Defrag on a CSV without first putting it in redirected mode, it will fail with the following error:

CSVFS failed operation as volume is not in redirected mode. (0x8007174F)

3. Run defrag on your CSV by running the following on an elevated command prompt:

Defrag.exe <CSV Mount Point>

4. Once defrag has completed, revert the CSV back into direct mode by using either of the follow methods:

a. Using Windows PowerShell^©open a new elevated Windows PowerShell console and run the following:

Resume-ClusterResource<Cluster Disk Name>

b. Using the Failover Cluster Manager right-click on the CSV and select “Turn Off Redirected Access”:

How is ChkDsk run on your CSV:

During the lifecycle of your file system corruptions may occur which require resolution through ChkDsk. As you are aware, CSVs in Windows Server 2012 R2 also supports the ReFS file system. However, the ReFS filesystem achieves self-healing through integrity checks on metadata. As a consequence, ChkDsk does not need to be run for CSV volumes with the ReFS file system. Thus, this discussion is scoped to corruptions in CSV with the NTFS file system. Also, note the redesigned ChkDsk operation introduced with Windows Server 2012, which separates the ChkDsk scan for errors (online operation) and the ChkDsk fix (offline operation). This results in higher availability for your Private Cloud storage since you only need to take your storage offline to fix corruptions in your storage (which is a significantly faster process than the scan for corruptions). In Windows Server 2012, we integrated ChkDsk /SpotFix into the cluster IsAlive health check for the Physical Disk Resource corresponding to the CSV. As a consequence we will now attempt to fix corruptions in your CSV without any perceptible downtime for your application.

Detection of Corruptions – ChkDsk /Scan:

The following is the workflow on Windows Server 2012 R2 systems to scan for NTFS corruptions:

Note:

If the system is never idle it is possible that the ChkDsk scan will never be run. In this case the administrator will need to invoke this operation manually. To invoke this operation manually, on an elevated command prompt run the following:

chkdsk.exe <CSV mount point name> /scan

Resolution of CSV corruptions during Physical Disk Resource IsAlive Checks:

The following is the CSV workflow in Windows Server 2012 R2 to fix corruptions:

Note:

In the rare event that a single CSV corruption takes greater than 15 seconds to fix, the above workflow will not resolve the error. In this case the administrator will need to manually fix this error. A CSV does not need to be place in maintenance or redirected mode before invoking chkdsk. The CSV will re-establish its state automatically once the chkdsk run has completed. To invoke this operation manually, on an elevated command prompt run the following:

chkdsk.exe <CSV mount point name> /SpotFix

Running Defrag or ChkDsk through Repair-ClusterSharedVolume cmdlet:

Running Defrag or ChkDsk on your CSV, through the Repair-ClusterSharedVolume, is deprecated. It is instead highly encouraged to directly use either Defrag.exe or ChkDsk.exe for your CSV, using the procedure indicated in the preceding sections. The use of the Repair-ClusterSharedVolume cmdlet, however is still supported by Microsoft. To use this cmdlet to run chkdsk or defrag, run the following on a new elevated Windows PowerShell console:

Repair-ClusterSharedVolume <Cluster Disk Name> -ChkDsk –Parameters <ChkDsk parameters>

Repair-ClusterSharedVolume <Cluster Disk Name> –Defrag –Parameters <Defrag parameters>

You can determine the Cluster Disk Name corresponding to your CSV using the Get-ClusterSharedVolume cmdlet by running the following:

Get-ClusterSharedVolume | fl *

Thanks!

Subhasish Bhattacharya
Program Manager
Clustering and High Availability
Microsoft

↧

Event ID 5120 in System Event Log

February 26, 2014, 1:42 pm

≫ Next: Cluster Shared Volume Diagnostics

≪ Previous: How to Run ChkDsk and Defrag on Cluster Shared Volumes in Windows Server 2012 R2

When conducting backups of a Windows Server 2012 or later Failover Cluster using Cluster Shared Volumes (CSV), you may encounter the following event in the System event log:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Description: Cluster Shared Volume 'VolumeName' ('ResourceName') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Having an Event ID 5120 logged may or may not be the sign of a problem with the cluster, based on the error code logged. Having an Event 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR or the error code c0130021 may be expected and can be safely ignored in most situations.

An Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR is logged on the node which owns the cluster Physical Disk resource when there was a VSS Software Snapshot which clustering knew of, but the software snapshot was deleted. When a snapshot is deleted which Failover Clustering had knowledge of, clustering must resynchronize its state of the view of the snapshots.

One scenario where an Event ID 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR may be logged is when using System Center Data Protection Manager (DPM), and DPM may delete a software snapshot once a backup has completed. When DPM requests deletion of a software snapshot, volsnap will mark the software snapshot for deletion. However volsnap conducts deletion in an asynchronous fashion which occurs at a later point in time. Even though the snapshot has been marked for deletion, Clustering will detect that the software snapshot still exists and needs to handle it appropriately. Eventually volsnap will perform the actual deletion operation of the software snapshot. When clustering then notices that a software snapshot it knew of was deleted, it must resynchronize its view of the snapshots.

Think of it as clustering getting surprised by an un-notified software snapshot deletion, and the cluster service telling the various internal components of the cluster service that they need to resynchronize their views of the snapshots.

There are also a few other expected scenarios where volsnap will delete snapshots, and as a result clustering will need to resynchronize its snapshot view. Such as if a copy on write fails due to lack of space or an IO error. In these conditions volsnap will log an event in the system event log associated with those failures. So review the system event logs for other events accompanying the event 5120, this could be logged on any node in the cluster.

Troubleshooting:

If you see a few random event 5120 with an error of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR or the error code c0130021, they can be safely ignored. We recognize this is not optimal as they create false positive alarms and trigger alerts in management software. We are investigating breaking out cluster state resynchronization into a separate non-error event in the future.
If you are seeing many Event 5120’s being logged, this is a sign that clustering is in need of constantly resynchronizing its snapshot state. This could be a sign of a problem and may require engaging Microsoft support for investigation.
If you are seeing event 5120’s logged with error codes other than STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR, it is a sign of a problem. Be due-diligent to review the error code in the description of all of the 5120’s logged be certain. Be careful not to dismiss the event because of a single event with STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR. If you see other errors logged, there are fixes available that need to be applied. Your first troubleshooting step should be to apply the recommended hotfixes in the appropriate article for your OS version:

Recommended hotfixes and updates for Windows Server 2012-based failover clusters
http://support.microsoft.com/kb/2784261

Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters
http://support.microsoft.com/kb/2920151
If an Event 5120 is accompanied by other errors, such as an Event 5142 as below. It is a sign of a failure and should not be ignored.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 5142
Task Category: Cluster Shared Volume
Level: Error
Description: Cluster Shared Volume 'VolumeName' ('ResourceName') is no longer accessible from this cluster node because of error 'ERROR_TIMEOUT(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

Thanks!
Elden Christensen
Principal Program Manager Lead
Clustering & High-Availability
Microsoft

↧

Cluster Shared Volume Diagnostics

March 13, 2014, 8:58 pm

≫ Next: Failover Clustering and IPv6 in Windows Server 2012 R2

≪ Previous: Event ID 5120 in System Event Log

This is the second blog post in a series about Cluster Shared Volumes (CSV). In this post we will go over diagnostics. We assume that reader is familiar with the previous blog post that explains CSV components and different CSV IO modes http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx

Is Direct IO on this Volume Possible?

Let’s assume you have created a cluster, added a disk to Cluster Shared Storage, you see that disk is online, and path to the volume (let’s say c:\ClusterStorage\Volume1) is accessible. The very first question you might have is if Direct IO even possible on this volume. With Windows Server 2012 R2 there is a PowerShell cmdlet that attempts to answer exactly that question:

Get-ClusterSharedVolumeState [[-Name] <StringCollection>] [-Node <StringCollection>] [-InputObject <psobject>] [-Cluster <string>] [<CommonParameters>]

If you run this PowerShell cmdlet providing name of the cluster Physical Disk Resource then for each cluster node it will tell you if on that node if the volume is in File System Redirected mode or Block Level Redirected mode, and will tell you the reason.

Here is how output would look like if Direct IO is possible

PS C:\Windows\system32> get-ClusterSharedVolumeState -Name "Cluster Disk 1"
Name                         : Cluster Disk 1
VolumeName                   : \\?\Volume{1c67fa80-1171-4a9e-9f41-0bb132e88ee4}\
Node                         : clus01
StateInfo                    : Direct
VolumeFriendlyName           : Volume1
FileSystemRedirectedIOReason : NotFileSystemRedirected
BlockRedirectedIOReason      : NotBlockRedirected

Name                         : Cluster Disk 1
VolumeName                   : \\?\Volume{1c67fa80-1171-4a9e-9f41-0bb132e88ee4}\
Node                         : clus02
StateInfo                    : Direct
VolumeFriendlyName           : Volume1
FileSystemRedirectedIOReason : NotFileSystemRedirected
BlockRedirectedIOReason      : NotBlockRedirected

In the output above you can see that Direct IO on this volume is possible on both cluster nodes.

If we put this disk in File System Redirected mode using

PS C:\Windows\system32> Suspend-ClusterResource -Name "Cluster Disk 1" -RedirectedAccess -Force

Name                                    State                                   Node
----                                    -----                                   ----
Cluster Disk 1                          Online(Redirected)                      clus01

Then output of get-ClusterSharedVolumeState will change to

PS C:\Windows\system32> get-ClusterSharedVolumeState -Name "Cluster Disk 1"

Name                         : Cluster Disk 1
VolumeName                   : \\?\Volume{1c67fa80-1171-4a9e-9f41-0bb132e88ee4}\
Node                         : clus01
StateInfo                    : FileSystemRedirected
VolumeFriendlyName           : Volume1
FileSystemRedirectedIOReason : UserRequest
BlockRedirectedIOReason      : NotBlockRedirected

Name                         : Cluster Disk 1
VolumeName                   : \\?\Volume{1c67fa80-1171-4a9e-9f41-0bb132e88ee4}\
Node                         : clus02
StateInfo                    : FileSystemRedirected
VolumeFriendlyName           : Volume1
FileSystemRedirectedIOReason : UserRequest
BlockRedirectedIOReason      : NotBlockRedirected

You can turn off File System redirected mode using following cmdlet

PS C:\Windows\system32> resume-ClusterResource -Name "Cluster Disk 1"

Name                                    State                                   Node
----                                    -----                                   ----
Cluster Disk 1                          Online                                  clus01

State of CSV volume does not have to be the same on all nodes. For instance if disk is not connected to all the nodes then you might see volume in Direct mode on nodes where disk is connected and in BlockRedirected mode on the nodes where it is not connected.

CSV volume might be in a Block Level Redirected mode for one of the following reasons

NoDiskConnectivity– Disk is not visible on/connected to this node. You need to validate your SAN settings.
StorageSpaceNotAttached– Space is not attached on this node. Many Storage Spaces on disk formats are not trivial, and cannot be accessed for read/write by multiple nodes at the same time. Cluster enforces that a Space is accessible by only one cluster node at a time. Space is detached on all other nodes and it is attached only on the node where corresponding Physical Disk Resource is online. The only type of Space that can be attached on multiple nodes and is a Simple Space, which does not have write-back cache.

When you are using a Mirrored or Parity Space then most often you will see that volume is in Direct IO mode on the coordinator node and in Block Redirected mode on all other nodes, and the reason for block redirected mode is StorageSpaceNotAttached. Please note that if a Space uses write-back cache then it always will be in Block Redirected mode even it is a Simple Space.

CSV might be in the File System Redirected mode for one of the following reasons

UserRequest– user put volume in redirected state. This can be done using the Failover Cluster Manager snap-in or PowerShell cmdlet Suspend-ClusterResource.
IncompatibleFileSystemFilter– An incompatible file system filter attached to the NTFS/REFS file system. Use “fltmc instances” system event log and cluster log to learn more. Usually that means you have installed a storage solution that uses a file system filter. In the previous blog post you can find samples of fltmc output. To resolve that you can either disable or uninstall the filter. The presence of a Legacy File System filter will always disable direct IO. If solution uses a File System Minifilter Driver then filters present at the following altitudes will cause CSV to stay in File System Redirected mode

300000 – 309999 Replication
280000 – 289999 Continuous Backup
180000 – 189999 HSM
160000 – 169999 Compression
140000 – 149999 Encryption

The reason is that some of these filters might do something that is not compatible with Direct IO or Block Level Redirected IO. For instance a replication filter might assume that it will observe all IO so it can then replicate data to the remote site. A compression or encryption filter might need to modify data before it goes to/from the disk. If we perform Direct IO or Block Redirected IO we will bypass these filters attached to NTFS and consequently might corrupt data. Our choice is to be safe by default so we put volume in File System Redirected Mode if we notice a filter at one of the above altitudes is attached to this volume. You can explicitly inform cluster that this filter is compatible with Direct IO by adding the minifilter name to the cluster common property SharedVolumeCompatibleFilters. If you have a filter that is not on one of the altitudes that are not compatible with Direct IO, but you know that it is not compatible then you can add this minifilter to the cluster property SharedVolumeIncompatibleFilters.

IncompatibleVolumeFilter - An incompatible volume filter attached below NTFS/REFS. Use system event log and cluster log to learn more. The reasons and solution are similar to what we’ve discussed above.
FileSystemTiering - Volume is in file system redirected mode because the volume is a Tiered Space with heatmap tracking enabled. Tiering heatmap assumes that it can see every IO. Information about IO operations is produced by REFS/NTFS. If we perform Direct IO then statistics will be incorrect and the tiering engine could make incorrect placement decisions by moving hot data to a cold tier or vice versa. You can control if per volume heatmap is enabled/disabled using

fsutil.exe tiering setflags/clearflags with flag /TrNH

If you choose to disable heatmap then you can control which files should go to what tier by pinning them to a tier using PowerShell cmdlet Set-FileStorageTier, and then running Optimize-Volume with –TierOptimize. Please note that for Optimize-Volume to work on CSV volume you need to put volume in File System Redirected mode using Suspend-ClusterResource. You can learn more about Storage Spaces tiering from this blogpost http://blogs.technet.com/b/josebda/archive/2013/08/28/step-by-step-for-storage-spaces-tiering-in-windows-server-2012-r2.aspx .
BitLockerInitializing– Volume is in redirected state because we are waiting for BitLocker to finish initial volume encryption of this volume.

If Get-ClusterSharedVolumeState tells volume on a node is in Direct IO state does it mean that absolutely all IO will go Direct IO way? The answer is: It is not so simple.

Here is another blog post that covers Get-ClusterSharedVolumeState PowerShell cmdlet http://blogs.msdn.com/b/clustering/archive/2013/12/05/10474312.aspx .

Is Direct IO on this File Possible?

Even if CSV volume is in Direct IO or Volume Level Redirected mode to be able to do Direct IO on a file there are number of preconditions that have to be true:

CSVFS understands on disk file format

Such as the file is not sparse, compressed, encrypted, resilient etc

There are no File System filters that might modify file layout or expect to see all IO

File System minifilters that provide compression, encryption, replication etc

There are no File System filters that object to Direct IO on the stream. An example would be the Windows Server Deduplication feature. When you install deduplication and enable it on a CSV volume it will NOT disable Direct IO on all files. Instead it will veto Direct IO only for the files that have been optimized by dedup.
CsvFs was able to make sure NTFS/REFS will not change location of the file data on the volume – file is pinned. If NTFS relocates file’s block while CSVFS does Direct IO that could result in volume corruption.
There are no applications that need to make sure IO is observed by NTFS/REFS stack. There is an FSCTL that an application can send to the file system to tell it to keep the file in File System Redirected mode for as long as this application has the file opened. File will be switched back to the redirected mode as soon as application closes the file.
CSVFS has appropriate oplock level. Oplocks guarantee cross node cache coherency. Oplocks are documented on MSDN http://msdn.microsoft.com/en-us/library/windows/hardware/ff551011(v=vs.85).aspx

Read-Write-Handle (RWH) or Read-Write (RW) for write. If CSVFS was able to obtain this level of oplock that means this file is opened only from this node.
Read-Write-Handle (RWH) or Read-Handle (RH) or Read-Write (RW) or Read (R) for reads. If CSVFS was able to obtain RH or R oplock then this file is opened from multiple nodes, but all nodes perform only file read or other operations that do not modify file content.

CSVFS was able to purge cache on NTFS/REFS. Make sure there is no stale cache on NTFS/REFS.

If any of the preconditions are not true then IO is dispatched using File System Redirected mode. If all preconditions are true then CSVFS will translate IO from file offsets to the volume offsets and will send it to the CSV Volume Manager. Keep in mind that CSV Volume Manager might send it using Direct IO to the disk when disk is connected or it might send it over SMB to the disk on the Coordinator node using Block level Redirected IO. CSV Volume Manager always prefer Direct IO, and Block Level Redirected IO is used only when disk is not connected or when disk fails IO.

Summary

To provide high availability and good performance CSVFS has several alternative ways how IO might be dispatched to the disk. This demonstrated some of the tools that can be used to analyze why CSV volume chooses one path for IO versus the other.

Thanks!
Vladimir Petter
Principal Software Development Engineer
Clustering & High-Availability
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

↧

Failover Clustering and IPv6 in Windows Server 2012 R2

March 24, 2014, 7:06 pm

≫ Next: Configuring a File Share Witness on a Scale-Out File Server

≪ Previous: Cluster Shared Volume Diagnostics

In this blog, I will discuss some common questions pertaining to IPv6 and Windows Server 2012 R2 Failover Clusters.

What network protocol does Failover Clustering default to?

If both IPv4 and IPv6 are enabled (which is the default configuration), IPv6 will be always used by clustering. The key take away is that it is not required to configure IPv4 when the IPv6 stack is enabled and you can go as far as to unbind IPv4. Additionally, you can use link-local (fe80) IPv6 address for your internal cluster traffic so IPv6 can be used for clustering even if you don’t use IPv6 for your public facing interfaces. Note that you can only have one cluster network using IPv6 link-local (fe80) addresses in your cluster. All networks that have IPv6 also have an IPv6 link-local address which is ignored if any IPv4 or other IPv6 prefix is present.

Should IPv6 be disabled for Failover Clustering?

The recommendation for Failover Clustering and Windows in general, starting in 2008 RTM, is to not disable IPv6 for your Failover Clusters. The majority of the internal testing for Failover Clustering is done with IPv6 enabled. Therefore, having IPv6 enabled will result in the safest configuration for your production deployment.

Will Failover Clustering cease to work if IPv6 is disabled?

A common misconception is that Failover Clustering will cease to work if IPv6 is disabled. This is incorrect. The Failover Clustering release criterion includes functional validation in an IPv4-only environment.

How does Failover Clustering handle IPv6 being disabled?

There are two levels at which IPv6 can be disabled:

1) At the adapter level: This is done by unbinding the IPv6 stack by launching ncpa.cpl and unchecking “Internet Protocol Version 6 (TCP/IPv6)”.

Failover Clustering behavior: NetFT, the virtual cluster adapter, will still tunnel traffic using IPv6 over IPv4.

2) At the registry level: This can be done using the following steps:

Launch regedit.exe
Navigating to the HKEY_LOCAL_MACHINE> SYSTEM > CurrentControlSet > services >TCPIP6 > Parameters key.
Right clicking Parameters in the left sidebar and choosing New->DWORD (32 bit) Value and creating an entry DisabledComponents with value FF.
Restarting your computer to disable IPv6

Failover Clustering behavior: This is the only scenario where NetFT traffic will be sent entirely over IPv4. It is to be noted that this is not recommended and not the mainstream tested code path.

Any gotchas with using Symantec Endpoint Protection and Failover Clustering?

A default Symantec Endpoint Protection (SEP) firewall policy has rules to Block IPv6 communication and IPv6 over IPv4 communication, which conflicts with the Failover Clustering communication over IPv6 or IPv6 over IPv4. Currently Symantec Endpoint Protection Firewall doesn't support IPv6. This is also indicated in the guidance from Symantec here. The default Firewall policies in SEP Manager is shown below:

It is therefore recommended that if SEP is used on a Failover Cluster, the rules indicated above blocking IPv6 and IPv6 over IPv4 traffic be disabled. Also, refer to the following article - About Windows and Symantec firewalls

Do Failover Clusters support static IPv6 addresses?

The Failover Cluster Manager and clustering in general is streamlined for the most common case (in which customers do not use static IPv6 address). Networks are configured automatically, in that the cluster will automatically generate IPv6 addresses for the IPv6 Address resources on your networks. If you prefer to select your own statically assigned IPv6 addresses, you can reconfigure the IPv6 Address resources using PowerShell as follows (it cannot be specified when the cluster is created):

Open a Windows PowerShell® console as an Administrator and do the following:

1) Create a new IPv6 Cluster IP Resource

Add-ClusterResource -Name "IPv6 Cluster Address" -ResourceType "IPv6 Address" -Group "Cluster Group"

2) Set the properties for the newly created IP Address resource

Get-ClusterResource "IPv6 Cluster Address" | Set-ClusterParameter –Multiple @{"Network"="Cluster Network 1"; "Address"= "2001:489828:4::";"PrefixLength"=64}

3) Stop the netname which corresponds to this static IPv6 address

Stop-ClusterResource "Cluster Name"

4) Create a dependency between the netname and the static IPv6 address

Set-ClusterResourceDependency "Cluster Name" "[Ipv6 Cluster Address]"

You might consider having an OR dependency with between the netname and, the static IPv6 and IPv4 addresses as follows:

Set-ClusterResourceDependency "Cluster Name" "[Ipv6 Cluster Address] or [Ipv4 Cluster Address]"

5) Restart the netname

Start-ClusterResource "Cluster Name"

For name resolution, if you prefer not to use dynamic DNS, you can configure DNS mappings for the address automatically generated by the cluster, or you can configure DNS mappings for your static address. Also note that, Cluster IPv6 Address resources do not support DHCPv6.

Thanks!

Subhasish Bhattacharya

Program Manager

Clustering & High Availability

Microsoft

↧

Configuring a File Share Witness on a Scale-Out File Server

March 31, 2014, 12:35 pm

≫ Next: Deploying SQL Server 2014 with Cluster Shared Volumes

≪ Previous: Failover Clustering and IPv6 in Windows Server 2012 R2

In this blog, I am going to discuss the considerations for configuring a File Share Witness (FSW) for the Failover Cluster hosting your workloads, on a separate Scale-Out File Server cluster. You can find more information on Failover Clustering quorum here.

File Share Witness on a Scale-Out File Server

It is supported to use a file share as a witness that is hosted on a Scale-Out File Server cluster. It is recommended that the following guidelines be considered when configuring your File Share Witness on a Scale-Out File Server:

Starting in Windows Server 2012 R2, the recommendation is to always configure a Witness for your cluster. The cluster will now automatically determine if the Witness is to have a vote in determining quorum for the cluster.
Create a new Server Message Block (SMB) share on the Scale-Out File Server for the exclusive use of a cluster. Note that the same share can be used for multiple clusters.
Ensure that the File Share has a minimum of 5MB provisioned per cluster it is used for.
The Scale-Out File Server hosting the file share to be used as a quorum witness should not be created within a Virtual Machine hosted on the same cluster for which the File Share Witness is being created.
Multi-site stretched-clusters:

With the Service Level Agreement (SLA) of automatic failover across sites, it is necessary that the Scale-Out File Server backing the File Share Witness be hosted in an independent third site. This enables sites with nodes participating in quorum equal opportunity to survive in case a site experiences a power outage or WAN link connectivity breaks.
With the SLA of manual failover across sites, we still recommend that the Scale-Out File Server backing the File Share Witness be hosted in an independent third site. This simplifies the recovery steps necessary in case of a primary site power outage. You may also configure the Scale-Out File Server to be hosted in the primary site. However note that this would require recreating the quorum witness while recovering the cluster from the Backup Disaster Recovery site.

Create a non-CA file share for the witness on the Scale-Out File server. A non-CA file share can result in a faster failover of the file share witness resource in the event the Scale-Out File server cluster is unavailable. For a CA share, the file share witness resource may not experience an immediate failure and may only timeout after the 90 second quorum timeout window. On a non-CA share, the file share witness resource will fail immediately, triggering remedial actions from the cluster service. Setting up the configuration for a non-CA share is explained in the next section.
The Scale-Out File Server hosting the File Share Witness should be a member of a domain in the same forest as the cluster it is a Witness for. This is because the Cluster uses the Cluster Name Object to set the permissions on a folder in the share containing the cluster specific information. This ensures that the Cluster has appropriate permissions needed to maintain appropriate cluster state in the share. Additionally, the cluster administrator configuring the File Share Witness needs to have Full Control permissions to the share. This is necessary to set the permissions for the Cluster Name Object to the folder in the share.
It is important that the file share created on the Scale-Out File Server is not part of a Distributed File System (DFS) Namespace. The cluster needs to be able to arbitrate a single point for quorum.

Configuring a File Share Witness on a Scale-Out File Server

In this section, I will explain how you can create a file share on a Scale-Out File Server that will act as a witness for the cluster hosting your workloads. Therefore, you have two clusters – a “storage” cluster hosting your file share witness and a “compute” cluster hosting your highly available workloads.

You can configure a File Share Witness on a Scale-Out File Server as follows:

1) Create the Scale-Out File Server as described in Section 2.1 of this article.

2) Create a File share on the Scale-Out File as described in Section 2.2 of this article.

a. Modify the properties of the share to make it a non-CA share. Right-click on the share, select Settings and uncheck the Enable continuous availability checkbox.

b. Ensure that you have Full Control to the newly created share.

3) Configure the File Share as a Witness on your cluster

a. Using the Failover Cluster Manager

i. Type cluadmin.msc on an elevated command prompt

ii. Launch the Quorum Wizard by Right-clicking on the Cluster Name, Selecting More Actions and then selecting Configure Quorum Settings

iii. Select Next and then choose the Select the quorum witness option and select Next.

iv. Choose the Configure a file share witness option and select Next.

v. Specify the path to the File Share on your Scale-Out File Server and select Next.

b. Using Windows PowerShell

i. Open a Windows PowerShell® console as an Administrator

ii. Type Set-ClusterQuorum –FileShareWitness <File Share Witness Path>

You should now see the File Share Witness configured for your Cluster.

When you navigate to your File Share Witness share you will see a folder created for your Cluster.

This folder will have permissions for the Cluster Name Object of your “Compute” Cluster so that the entries in the folder can be modified on Cluster membership changes.

You will also notice a file Witness.log which contains the membership information for the Cluster.

You have now successfully configured a File Share Witness on a Scale-Out File Server, for the cluster hosting your workloads.

Thanks!

Subhasish Bhattacharya

Program Manager

Clustering and High Availability

Microsoft

↧

Deploying SQL Server 2014 with Cluster Shared Volumes

May 8, 2014, 2:03 pm

≫ Next: Sessions at TechEd Houston 2014 from the Cluster team

≪ Previous: Configuring a File Share Witness on a Scale-Out File Server

An exciting new feature in SQL Server 2014 is the support for the deployment of a Failover Cluster Instance (FCI) with Cluster Shared Volumes (CSV). In this blog, I am going to discuss the value of deploying SQL Server with CSV as well as how you can deploy SQL with CSV. I will also be discussing this topic at TechEd North America 2014 at the following session:

Update post TechEd: You can now find this session online to watch on-demand here.

Value of Deploying SQL 2014 with CSV

A SQL 2014 deployment with Cluster Shared Volumes provides several advantages over a deployment on “traditional” cluster storage.

Scalability

Consolidation of multiple SQL instances: With traditional cluster storage, each SQL instance requires a separate LUN to be carved out. This because the LUN would need to failover with the SQL instance. CSV allows nodes in the cluster to have shared access to storage. This facilitates the consolidation of SQL instances by storing multiple SQL instances on a single CSV.

Better capacity planning, storage utilization: Consolidating multiple SQL instances on a single LUN makes the storage utilization more efficient.

Addresses drive letter limitation: Traditionally, the number of SQL instances that can be deployed on a cluster is limited to the number of drive letters (24 excluding the system drive and a drive for a peripheral device). There is no limit to the number of mount points for a CSV. Therefore, scalability of your SQL deployment is enhanced.

Availability

Resilience from storage failures: When storage connectivity on a node is disrupted, CSV routes traffic over the network using SMB 3.0 allowing the SQL instance to remain operational. In a traditional deployment, the SQL instance would need to be failed over to a node with connectivity to the storage, resulting in downtime.

Fast failover: Given that nodes in a cluster have shared access to storage, a SQL Server failover no longer required the dismounting and remounting of volumes. Additionally, the SQL Server DB is moved without drive ownership changes.

Zero downtime Chkdsk: CSV integrates with the improvements in Chkdsk in Windows Server 2012 to provide a disk repair without any SQL Server downtime.

Operability

With CSV, the management of your SQL Server Instance is simplified. You are able to manage the underlying storage from any node as there is an abstraction to which node owns the disk.

Performance and Security

CSV Block Cache: CSV provides a distributed read-only cache for unbuffered I/O to SQL databases.

BitLocker Encrypted CSV: With the CSV integration with BitLocker you have an option to secure your deployments outside your datacenters such as at branch offices. Volume level encryption allows you to meet compliance requirements.

How to deploy a SQL Server 2014 FCI on CSV

You can deploy a SQL Server 2014 FCI on CSV with the following steps:

Note: The Steps to deploy a SQL Server FCI with CSV is identical with that with traditional storage except for Steps 3, 4 and 19 below. The remaining steps have been provided as a reference. For detailed instructions on the installation steps for a "traditional" FCI deployment refer to: http://technet.microsoft.com/en-us/library/hh231721.aspx

1) Create the cluster which will host the FCI deployment.

2) Run validation on your cluster and ensure that there are no errors.

3) Provision storage for your cluster. Add the storage to the cluster. You may rename the cluster disks corresponding to the storage for your convenience. Add the cluster disks to CSV.

4) Rename your CSV mount points to enhance your manageability

5) Install .NET Framework 3.5

Using Windows PowerShell®

Using Server Manager

6) Begin SQL installation on the first cluster node. Choose the Installation tab and choose the New SQLServer failover cluster installation option.

7) Enter the Product Key

8) Accept the License Terms

9) Choose to use Microsoft Update to check for updates.

10) Failover Cluster rules will be installed. It is essential that this step completes without errors.

11) Choose the SQLServer Feature Installation option.

12) Select the Database Engine Services and Management Tools – Basic features.

13) Provide a Network Name for your SQL instance.

14) Specify a name for the SQL Server cluster resource group of proceed with the default.

15) Proceed with the default Cluster Disk selected. We will adjust this selection in step 19.

16) Choose both the IPv4 and IPv6 networks if available.

17) Configure your SQL Server Agent and Database Engine accounts

18) Specify your SQL Server administrators and choose your authentication mode.

19) Select the Data Directories tab. This allows you to customize the Cluster Shared Volumes paths where you want to store the files corresponding to your SQL Database.

20) Proceed with the final SQL Server installation.

On completion of installation you will now see the FCI data files stored in the CSV volumes specified.

Failover Cluster Manager (type cluadmin.msc on an elevated command prompt to launch) will reflect the SQL server instance deployed.

21) Now add the other cluster nodes to the FCI. In the SQL Server Installation Center, choose the Add node to a SQL Server failover cluster option.

22) Analogous to the installation on node 1. Proceed with the addition of the cluster node to the FCI.

Once your installation is done you can test a failover of your SQL instance through the Failover Cluster Manager. Right Click on the SQL Server role and choose to Move to the Best Possible Node.

Note the difference with CSV. Your CSV will remain online for the duration of the SQL Server failover. There is no need to failover the storage to the node the SQL Server instance is moved to.

Thanks and hope to see you at TechEd!

Subhasish Bhattacharya

Program Manager

Clustering & High Availability

Microsoft

↧

Sessions at TechEd Houston 2014 from the Cluster team

May 16, 2014, 8:19 pm

≫ Next: Cluster Shared Volume Performance Counters

≪ Previous: Deploying SQL Server 2014 with Cluster Shared Volumes

The Cluster team presented multiple exciting sessions at the sold-out TechEd Houston from May 12-15^th! If you didn’t get a chance to attend the conference in-person, the sessions are now posted online so you can watch on-demand! Here are the sessions, their descriptions, and links to the videos.

The sessions from the clustering team at TechEd Houston:

1) DCIM-B354 Failover Clustering: What's New in Windows Server 2012 R2

This session will give the complete roadmap of the wealth of new Failover Clustering features and features which enable high availability scenarios in Windows Server 2012 R2. If you are going to attend one session at TechReady on clustering / high availability… this is it! This session will cover all the incremental feature improvements from Windows Server 2012 to Windows Server 2012 R2 for clustering and availability.

2) DCIM-B364 Step-by-Step to Deploying Microsoft SQL Server 2014 with Cluster Shared Volumes

SQL Server 2014 now supports deploying Failover Cluster Instances on top of Windows Server 2012 R2 Cluster Shared Volumes (CSV). You can now leverage the same CSV storage deployment model you do for your Hyper-V and Scale-out File Server deployments, with SQL Server. This session walks through how to configure a highly available SQL Server 2014 on top of Cluster Shared Volumes (CSV). It discusses best practices and recommendations.

3) DCIM-B349 Software-Defined Storage in Windows Server 2012 R2 and Microsoft System Center 2012 R2

Regardless of whether you’re building on a private infrastructure, in a hybrid environment, or deploying to a public cloud, there are optimizations you can make in storage and availability that will improve the manageability and performance of your application and environment. Join this session to hear more about the end-to-end scale, performance, and availability improvements coming with Windows Server. We dive into deploying and managing storage using SMB shares, show the improved experience in everyday storage management such as deploying patches to the cloud, and share how to leverage faster live migration when responding to new load demands. Starting with Windows Server 2012, Microsoft offered a complete storage stack from the hardware up, leveraging commodity JBODs surfaced as Virtual Disks via Storage Spaces, hosted by Scale-Out File Server nodes and servicing client requests via SMB 3.0. Now, with major features added in Windows Server 2012 R2 (e.g., Storage Tiering and SMB Redirection), the story gets even better! As a critical piece in the Modern Datacenter (i.e., a Software-Defined Datacenter), SDS plays a crucial role in improving utilization and increasing cost efficiency, scalability, and elasticity. This session empowers you to architect, implement, and monitor this key capability. Come learn how to design, deploy, configure, and automate a storage solution based completely on Windows technologies, as well as how to troubleshoot and manage via the in-box stack and Microsoft System Center. Live demos galore!

4) FDN06 Transform the Datacenter: Making the Promise of Connected Clouds a Reality

Cloud computing continues to shift the technology landscape, but most are still working to truly harness its potential for their organizations. How can you bring cloud computing models and technologies into your datacenter? How can you implement a hybrid approach across clouds that delivers value and meets your unique needs? This foundational session is all about making it real from the ground up. Topics include new innovations spanning server virtualization and infrastructure as a service, network virtualization and connections to the cloud, ground-breaking on-premises and cloud storage technologies, business continuity, service delivery, and more. Come learn from Microsoft experts across Windows Server, System Center, and Microsoft Azure about how you can apply these technologies to bring your datacenter into the modern era!

Thanks!
Subhasish Bhattacharya
Program Manager
Clustering & High Availability
Microsoft

↧

Cluster Shared Volume Performance Counters

June 5, 2014, 9:35 am

≫ Next: Planning Failover Cluster Node Sizing

≪ Previous: Sessions at TechEd Houston 2014 from the Cluster team

This is the third blog post in a series about Cluster Shared Volumes (CSV). In this post we will go over performance monitoring. We assume that reader is familiar with the previous blog posts. Blog post http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx explains CSV components and different CSV IO modes. Second blog post http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspxexplains tools that help you to understand why CSV volume uses one or another mode for IO.

Performance Counters

Now let’s look at the various performance counters which you can leverage to monitor what is happening with a CSV volume.

Physical Disks Performance Counters

These performance counters are not CSV specific. You can find “Physical Disk” performance counters on every node where disk is physically connected.

There are large number of good articles that describe how to use Physical Disk performance counters (for instance here http://blogs.technet.com/b/askcore/archive/2012/03/16/windows-performance-monitor-disk-counters-explained.aspx ) so I am not going to spend much time explaining them. The most important consideration when looking at counters on a CSV is to keep in mind that values of the counters is not aggregated across nodes. For instance if one node tells you that “Avg. Disk Queue Length” is 10, and another tells you 5 then actual queue length on the disk is about 15.

SMB Client and Server Performance Counters

CSV uses SMB to redirect traffic using File System Redirected IO or Block Level Redirected IO to the Coordinating node. Consequently SMB performance counters might be a valuable source of insight.

On the non-coordinating node you would want to use SMB Client Shares performance counters. Following blog post explains how to read these counters http://blogs.technet.com/b/josebda/archive/2012/11/19/windows-server-2012-file-server-tip-new-per-share-smb-client-performance-counters-provide-great-insight.aspx .

On the coordinating node you can use SMB Server Shares performance counters that work in the similar way, and would allow you to monitor all the traffic that comes to the Coordinating node on the given share.

To map the CSV volume to the hidden SMB share that CSV uses to redirect traffic you can run following command to find CSV volume ID:

PS C:\Windows\system32> Get-ClusterSharedVolume | fl *

Name             : Cluster Disk 1
State            : Online
OwnerNode        : clus01
SharedVolumeInfo : {C:\ClusterStorage\Volume1}
Id               : 6861be1f-bf50-4bdb-941d-0a2dd2a46711

CSV volume ID is also used as the SMB share name. As we’ve discussed in the previous post to get list of CSV hidden shares you can use Get-SmbShare. Starting from Windows Server 2012 R2 you also need to add -SmbInstance CSV parameter to that cmdlet to see CSV internal hidden shares. Here is an example:

PS C:\Windows\system32> Get-SmbShare -IncludeHidden -SmbInstance CSV

Name                          ScopeName      Path                          Description
----                          ---------      ----                          -----------
6861be1f-bf50-4bdb-941d-0a... *              \\?\GLOBALROOT\Device\Hard...
CSV$                          *              C:\ClusterStorage

All File System Redirected IO will be sent to the share named 6861be1f-bf50-4bdb-941d-0a2dd2a46711 on the Coordinating node. All the Block Level Redirected IO will be sent to the share CSV$ on the Coordinating node.

In case if you are using RDMA you can use SMB Direct Connection performance counters. For instance if you are wondering if RDMA is used you can simply look at these performance counters on client and on the server.

If you are using Scale Out File Server then SMB performance counters also will be helpful to monitor IO that comes from the clients to the SOFS and CSVFS.

Cluster CSV File System Performance Counters

CSVFS provides a large number of performance counters. Logically we can split these counters into 4 categories

Redirected: All counters that start with prefix “Redirected” help you to monitor if IO is forwarded using File System Redirected IO and its performance. Please note that these counters do NOT include the IO that is forwarded using Block Redirected IO. These counters are based on measuring time from when IO was sent to SMB (if we are on non-Coordinating node) or to NTFS (if we are on Coordinating node) until this component completed the IO. It does not include the time this IO spent inside CSVFS. If we are on non-Coordinating node then the values you observe through these counters should be very close to the corresponding values you would see using SMB Client Share performance counters.
IO: All counters that start with the prefix “IO” help you to monitor if IO is forwarded using Direct IO or Block Level Redirected IO and its performance. These counters are based on measuring time from when IO was sent to CSV Volume Manager until this component completed the IO. It does not include the time this IO spent inside CSVFS. If CSV Volume Manger does not forward any IO using Block Level Redirected IO path, but all the IO are dispatched using Direct IO then the values you will observe using these counters will be very close to what you would see using corresponding Physical Disk performance counters on this node.
Volume: All counters that start with prefix “Volume” help you to monitor current CSVFS state and its history.
Latency: All other counters help you to monitor how long IO took in CSVFS. This time includes how long the IO spend inside CSVFS waiting for its turn, as well as how long CSVFS was waiting for its completion from the underlying components. If IO is pause/resumed during failure handling then this time is also included.

This diagram demonstrates what is measured by the CSVFS performance counters on non-Coordinating node.

As we've discussed before on non-Coordinating node File System Redirected IO and Block Redirected IO goes to SMB client. On Coordinating node you will see a similar picture, except that File System Redirected IO will be sent directly to NTFS, and we would never use Block Level Redirected IO.

Counters Reference

In this section we will step through each counter and go into detail on specific counters. If you find this section too tedious to read then do not worry. You can skip over it and go directly to the scenario. This chapter will work for you as a reference.

Now let’s go through the performance counters in each of these groups starting from the counters with prefix IO. I want to remind you again that this group of counters tells you only about IOs that are sent to the CSV Volume Manager, and does NOT tell you how long IO spent inside CSVFS. It only measures how long it took for CSV Volume Manager and all components below it to complete the IO. For a disk it is unusual to see that some IO go Direct IO while other go Block Redirected IO. CSV Volume Manager always prefers Direct IO, and uses Block Redirected IO only if disk is not connected or if disk completes IO with an error. Normally either all IO are sent using Direct IO or Block Redirected IO. If you see a mix that might mean something wrong with path to the disk from this node.

IO

IO Write Avg. Queue Length
IO Read Avg. Queue Length

These two counters tell how many outstanding reads and writes we have on average per second. If we assume that all IOs are dispatched using Direct IO then value of this counter will be approximately equal to PhysicalDisk\Avg.Disk Write Queue Length and PhysicalDisk\Avg.Disk Read Queue Length accordingly. If IO is sent using Block Level Redirected IO then this counter will be reflecting SMB latency, which you can monitor using SMB Client Shares\Avg. Write Queue Length and SMB Client Shares\Avg. Read Queue Length.

IO Write Queue Length
IO Read Queue Length

These two counters tell how many outstanding reads and writes we have at the moment of sample. A reader might wonder why do we need average queue length as well as current queue length and when does he need to look at one versus the other. The only shortcoming with average counter is that it is updated on IO completion. Let’s assume you are using perfmon that by default samples performance counters every second. If you have an IO that is taking 1 minute then for 1 minute average queue length will be 0, and once IO completes for a second it will become 60, while queue length tells you the length of IO queue at the moment of the sample so it will be telling you 1 for all 60 samples during the minute this IO is in progress. On the other hand if IO are completing very fast (microseconds) then there is a high chance that at the moment of the sample IO queue length will be 0 because we just happened to sample at the time when there was no IOs. In that case average queue length would be much more meaningful. Reads and writes are usually completing in microseconds or milliseconds so in majority of cases you want to look at the average queue length.

IO Writes/sec
IO Reads/sec

Tells you how many read/write operations on average have completed in the past second. When all IO are sent using Direct IO then this value should be very close to PhysicalDisk\Disk Reads/sec and PhysicalDisk\Disk Writes/sec accordingly. If IO is sent using Block Level Redirected IO then counters value should be close to Smb Client Share\Write Requests/sec and Smb Client Share\ Read Requests/sec.

IO Writes
IO Reads

Tells you how many read/write operations have completed since the volume was mounted. Keep in mind that values of these counters will be reset every time file system dismounts and mounts again for instance when you offline/online corresponding cluster disk resource.

IO Write Latency
IO Read Latency

Tells you how many seconds read/write operations take on average. If you see a value 0.003 that means IO takes 3 milliseconds. When all IO are sent using Direct IO then this value should be very close to PhysicalDisk\Avg.Disk sec/Read and PhysicalDisk\Avg.Disk sec/Write accordingly. If IO are sent using Block Level Redirected IO then counters value should be close to Smb Client Share\Avg.sec/Write and Smb Client Share\Avg.sec/Read.

IO Read Bytes/sec
IO Read Bytes
IO Write Bytes/sec
IO Write Bytes

These counters are similar to the counters above, except that instead of telling you # of read and write operations (a.k.a IOPS) they are telling throughput measured in bytes.

IO Split Reads/sec
IO Split Reads
IO Split Writes/sec
IO Split Writes

These counters help you to monitor how often CSVFS needs to split IO to multiple IOs due to the disk fragmentation, when contiguous file offsets map to disjoined blocks on the disk. You can reduce fragmentation by running defrag. Please remember that to run defrag on CSVFS you need to put it to the File System Redirected mode so CSVFS would not disable block moves on NTFS. There is no straight answer to the question if a particular value of the counter is bad. Remember that this counter does not tell you how much fragmentation is on the disk. It only tells you how much fragmentation is being hit by ongoing IOs. For instance you might have all IOs going to couple locations on the disk that happened to be fragmented, and the rest of the volume is not fragmented. Should you worry then? It depends … if you are using SSDs then it might not matter, but if you are using HDDs then running defrag might improve throughput if it will make IO more sequential. Another common reason to run defrag is to consolidate free space so it can be trimmed. This is particularly important with SSDs or thinly provisioned disks. CSVFS IO Split performance counters would not help with monitoring free space fragmentation.

IO Single Reads/sec
IO Single Reads
IO Single Writes/sec
IO Single Writes

Last set of counters in that group tells you how many IO were dispatched without need to be split. It is a complement of the corresponding “IO Split” counters, and is not that interesting for performance monitoring.

Redirected

Next we will go through the performance counters with prefix Redirected. I want to remind you that this group of counters tells you only about IOs that are sent to NTFS directly (on Coordinating node) or over SMB (from a non-Coordinating node), and does NOT tell you how long IO spent inside CSVFS, but only measures how long it took for SMB/NTFS and all components below it to complete the IO.

Redirected Writes Avg. Queue Length
Redirected Reads Avg. Queue Length

These two counters tell how many outstanding reads and writes do we have on average per second. This counter will be reflecting SMB latency, which you can monitor using SMB Client Shares\Avg. Write Queue Length and SMB Client Shares\Avg. Read Queue Length.

Redirected Write Queue Length
Redirected Read Queue Length

These two counters tell how many outstanding reads and writes we have at the moment of sample. Please read comments for the IO Write Queue Length and IO Read Queue Length counters if you are wondering when you should look at average queue length versus the current queue length.

Redirected Write Latency
Redirected Read Latency

Tells you how many milliseconds read/write operations take on average. Counters value should be close to Smb Client Share\Avg.sec/Write and Smb Client Share\Avg.sec/Read.

Redirected Read Bytes/sec
Redirected Read Bytes
Redirected Reads/sec
Redirected Reads
Redirected Write Bytes/sec
Redirected Write Bytes
Redirected Writes/sec
Redirected Writes

These counters will help you to monitor IOPS and throughput, and do not require much explaining. Note that when CSVFS sends IO using FS Redirected IO it will never split IO on a fragmented files because it forwards IO to NTFS, and NTFS will perform translation of file offsets to the volume offsets and will split IO into multiple if required due to file fragmentation.

Volume

Next we will go through the performance counters with prefix Volume. For all the counters in this group please keep in mind that values start fresh from 0 every time CSVFS mounts. For instance offlining and onlining back corresponding cluster physical disk resource will reset the counters.

Volume State

Tells current CSVFS volume state. Volume might be in one of the following states.

0 - Init state. In that state all files are invalidated and all IOs except volume IOs are failing.
1 - Paused state. In this state volume will pause any new IO and down-level state is cleaned.
2 - Draining state. In this state volume will pause any new IO, but down-level files are still opened and some down-level IOs might be still in process.
3 - Set Down Level state. In this state volume will pause any new IO. The down-level state is already reapplied.
4 - Active state. In this state all IO are proceeding as normal.

Down-level in the state descriptions above refer to the state that CSVFS has on the NTFS. Example of that state would be files opened by CSVFS on NTFS, byte range locks, file delete disposition, oplock states etc.

Volume Pause Count – Total

Number of times volume was paused. This includes the number of times volume was paused because user told cluster to move corresponding physical disk resource from one cluster node to another. Or when customer turns volume redirection on and off.

Volume Pause Count - Other
Volume Pause Count - Network
Volume Pause Count – Disk

Number of time volume this node experienced network, disk or some other failure that caused CSVFS to pause all IO on the volume and go through the recovery circle.

Latency

And here comes the last set performance counters from CSVFS group. Counters in this group do not have a designated prefix. These counters are measuring IO at the time when it arrives to the CSVFS, and include all the time IO spent at any layers inside or below CSVFS.

Write Queue Length
Read Queue Length

These two counters tell how many outstanding reads and writes we have at the moment of sample.

Write Latency
Read Latency

These counters tell you how much time on average passes since IO has arrived to CSVFS before CSVFS completes this IO. It includes the time IO spends at any layer below CSVFS. Consequently it includes IO Write Latency, IO Read Latency, Redirected Write Latency and Redirected Read Latency depending on the type of IO and how IO was dispatched by the CSVFS.

Writes/sec
Reads/sec
Writes
Reads

These counters will help you monitor IOPS and throughput, and hopefully do not require much explaining.

Flushes/sec
Flushes

These counters tell you how many flushes come to CSVFS on all the file objects that are opened

Files Opened

Tells how many files are currently opened on this CSV volume

Files Invalidated - Other
Files Invalidated - During Resume

CSVFS provides fault tolerance and attempts to hide various failures from application, but in some cases it might need to indicate that recovery was not successful. It does that by invalidating application’s file open and by failing all IOs issued on this open. These two counters allow you to see how many files opens were invalidated. Please note that invalidating open does not do anything bad to the file on the disk. It simply means that application will see IO failure and will need to reopen the file and reissue these IOs.

Create File/sec
Create File

Allows you to monitor how many file opens are happening on the volume.

Metadata IO/sec
Metadata IO

This is catch all for all other operations that are not covered by any counters above. This counters will be incremented when you query or set file information or issue an FSCTL on a file.

Performance Counter Relationships

To better understand relationship between different groups of CSVFS performance counters lets go through lifetime of some hypothetical non-cached write operation.

A non-cached write comes to CSVFS

CSVFS increments Write Queue Length
CSVFS remembers the timestamp when IO has arrived.

Let’s assume CSVFS decides that it can perform Direct IO on the file and it dispatches the IO to the CSV Volume Manager.

CSVFS increments IO Write Queue Length
CSVFS remembers the timestamp when IO was forwarded to the volume manager

Let’s assume IO fails because something has happened to the disk and CSV Volume Manager is not able to deal with that. CSVFS will pause this IO and will go through recovery.

CSVFS decrements IO Write Queue Length
CSVFS take timestamp of completion, subtracts the timestamp we took in step 2.ii. This tells us how long CSV Volume manager took to complete this IO. Using this value CSVFS update IO Write Avg. Queue Length and IO Write Latency
CSVFS increments IO Writes and IO Writes/sec counter
Depending if this write had to be split due to file fragmentation CSVFS increments either IO Single Writes and IO Single Writes/sec or IO Split Writes and IO Split Writes/sec.

Once CSVFS recovered it will reissue the paused write. Let’s assume that this time CSVFS finds that it has to dispatch IO using File System Redirected IO

CSVFS increments Redirected Write Queue Length
CSVFS remembers the timestamp when IO was forwarded to NTFS directly or over SMB

Let’s assume SMB completes write successfully

CSVFS decrements Redirected Write Queue Length
CSVFS take timestamp of completion, subtracts the timestamp we took in step 4.ii. This tells us how long SMB and NTFS took to complete this IO. Using this value CSVFS update Redirected Writes Avg. Queue Length and Redirected Write Latency
CSVFS increments Redirected Write and Redirected Write/sec counters.

If necessary CSVFS will do any post-processing after the IO completion and finally will complete the IO

CSVFS decrements Write Queue Length
CSVFS take timestamp of completion, subtracts the timestamp we took in step 1.ii. This tells us how long CSVFS took to complete this IO. Using this value CSVFS updates Write Latency
CSVFS increments Writes and Writes/sec counters.

Note the color highlighting in the sample above. It is emphasizing relationship of the performance counter groups. We can also describe this scenario using following diagram where you can see how time ranges are included

CSV volume pause is a very rear event and very few IOs run into it. For majority of IOs timeline would look in one of the following ways

In these cases Read Latency and Write Latency will be the same as IO Read Latency, IO Write Latency, Redirected Read Latency and Redirected Write Latency depending on the IO type and how it was dispatched.

In some cases file system might hold an IO. For instance if IO extends a file then it will be serialized with other extending IO on the same file. In that case timeline might looks like this

The important point here is that Read Latency and Write Latency are expected to be slightly larger than its IO* and Redirected* partner counters.

Cluster CSV Volume Manager

One thing that we cannot tell using CSVFS performance counters is if IO was sent using Direct IO or Block Level Redirected IO. This is because CSVFS does not have that information as it happens at a lower layer, only CSV Volume Manager knows that. You can get visibility to what is going on in the CSV volume manger using Cluster CSV Volume Manager counter set.

Performance counters in this group can be split into 2 categories. The first category has Redirected in its name and second does not. All the counters that do NOT have Redirected in the name are describing what is going on with Direct IO, and all the counters that do describe Block Level Redirected IO. Most of the counters are self-explanatory. Two require some explaining. If disk is connected then CSV Volume Manager always first attempts to send IO directly to the disk. If disk fails the IO the CSV Volume Manger will increment Direct IO Failure Redirection and Direct IO Failure Redirection/sec and will retry this IO using Block Level Redirected IO path over SMB. So these two counters help you to tell if IO is redirected because disk is not physically connected or because disk is failing IO for some reason.

Common Scenarios

In this section we will go over the common scenario/question that can be answered using performance counters.

Disclaimer: Please do not read much into the actual values of the counters in the samples below because samples were taken on test machines that are backed by extremely slow storage and on the machines with bunch of debugging features enabled. These samples are here to help you to understand relationship between counters.

Is direct IO is happening?

Simply check IOPS and throughput using following CSV Volume Manager counters

\Cluster CSV Volume Manager(*)\IO Reads/sec
\Cluster CSV Volume Manager(*)\IO Writes/sec
\Cluster CSV Volume Manager(*)\IO Read-Bytes/sec
\Cluster CSV Volume Manager(*)\IO Write-Bytes/sec

Values should be approximately equal to the load that you are placing on the volume.

You can also verify that no unexpected redirected IO is happening by checking IOPs and throughput on the CSV File System Redirected IO path using CSVFS performance counters

\Cluster CSV File System(*)\Redirected Reads/sec
\Cluster CSV File System(*)\Redirected Writes/sec
\Cluster CSV File System(*)\Redirected Read Bytes/sec
\Cluster CSV File System(*)\Redirected Write Bytes/sec

and CSV Volume Redirected IO path

\Cluster CSV Volume Manager(*)\IO Reads/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Writes/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Read-Bytes/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Write-Bytes/sec – Redirected

Above is an example of how Performance Monitor would look like on the Coordinator node when all IO is going using Direct IO. You can see that \Cluster CSV File System(*)\Redirected* counters are all 0 as well as \Cluster CSV Colume Manager(*)\* - Redirected performance counters are 0. This tells us that there is no File System Redirected IO or Block level Redirected IO.

What is total IOPS and throughput?

You can check how much over all IO is going through a CSV volume using following CSVFS performance counters

\Cluster CSV File System(*)\Reads/sec
\Cluster CSV File System(*)\Writes/sec

Value of the counters above will be equal to the sum of IO going to the File System Redirected IO path

\Cluster CSV File System(*)\Redirected Reads/sec
\Cluster CSV File System(*)\Redirected Writes/sec
\Cluster CSV File System(*)\Redirected Read Bytes/sec
\Cluster CSV File System(*)\Redirected Write Bytes/sec

and IO going to the CSV Volume Manager, which volume manager might dispatch using Direct IO or Block Redirected IO

\Cluster CSV File System(*)\IO Reads/sec
\Cluster CSV File System(*)\IO Writes/sec
\Cluster CSV File System(*)\IO Read Bytes/sec
\Cluster CSV File System(*)\IO Write Bytes/sec

What is Direct IO IOPs and throughput?

You can check how much IO is sent by the CSV volume manager directly to the disk connected on the cluster node using following performance counters

\Cluster CSV Volume Manager(*)\IO Reads/sec
\Cluster CSV Volume Manager(*)\IO Writes/sec
\Cluster CSV Volume Manager(*)\IO Read-Bytes/sec
\Cluster CSV Volume Manager(*)\IO Write-Bytes/sec

Value of these counters will be approximately equal to the values of the following performance counters of the corresponding physical disk

\PhysicalDisk(*)\Disk Reads/sec
\PhysicalDisk(*)\Disk Writes/sec
\PhysicalDisk(*)\Disk Read Bytes/sec
\PhysicalDisk(*)\Disk Write Bytess/sec

What is File System Redirected IOPS and throughput?

You can check how much IO is sent by the CSVFS directly to the disk connected on the cluster node using following performance counters

\Cluster CSV File System(*)\Redirected Reads/sec
\Cluster CSV File System(*)\Redirected Writes/sec
\Cluster CSV File System(*)\Redirected Read Bytes/sec
\Cluster CSV File System(*)\Redirected Write Bytes/sec

The picture above shows you what you would see on the coordinating node if you put Volume1 in the File System Redirected mode. You can see that only \Cluster CSV File System(*)\Redirected* are changing while \Cluster CSV File System(*)\IO* are all 0. Since File system redirected IO does not go through the CSV Volume manager its counters stay 0. File System Redirected IO go to NTFS, and NTFS sends this IO to the disk so you can see Physical Disk counters match to the CSV File System Redirected IO counters.

What is Block Level Redirected IOPs and throughput?

You can check how much IO CSV Volume Manager dispatches directly to the Coordinating node using SMB using following performance counters.

\Cluster CSV Volume Manager(*)\IO Reads/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Writes/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Read-Bytes/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Write-Bytes/sec – Redirected

Please note that since on the Coordinating node disk is always present CSV will always use Direct IO, and will never use Block Redirected IO so on the Coordinating node values of these counters should stay 0.

What is average Direct IO and Block Level Redirected IO latency?

To find out Direct IO latency you need to look at the counters

\Cluster CSV File System(*)\IO Read Latency
\Cluster CSV File System(*)\IO Write Latency

To understand where this latency is coming from you need to first look at the following CSV Volume Manager performance counters to see if IO is going Direct IO or Block Redirected IO

\Cluster CSV Volume Manager(*)\IO Reads/sec
\Cluster CSV Volume Manager(*)\IO Writes/sec
\Cluster CSV Volume Manager(*)\IO Reads/sec - Redirected
\Cluster CSV Volume Manager(*)\IO Writes/sec – Redirected

If IO goes to Direct IO then next compare CSVFS latency to the latency reported by the disk

\PhysicalDisk(*)\Avg. Disk sec/Read
\PhysicalDisk(*)\Avg. Disk sec/Write

If IO goes to Block Redirected Direct IO, and you are on Coordinator node then you still need to look at the Physical Disk performance counters. If you are on non-Coordinator node then compare look at the latency reported by SMB on the CSV$ share using following counters

\SMB Client Shares(*)\Avg sec/Write
\SMB Client Shares(*)\Avg sec/Read

Then compare SMB client latency to the Physical disk latency on the coordinator node.

Below you can see a sample where all IO goes Direct IO path and latency reported by the physical disk matches to the latency reported by the CSVFS, which means the disk is the only source of the latency and CSVFS does not add on top of it.

What is File System Redirected IO latency?

To find out File System Redirect IO latency you need to look at the counters

\Cluster CSV File System(*)\Redirected Read Latency
\Cluster CSV File System(*)\Redirected Write Latency

To find out where this latency is coming from on coordinator node compare it to the latency reported by the physical disk.

\PhysicalDisk(*)\Avg. Disk sec/Read
\PhysicalDisk(*)\Avg. Disk sec/Write

If you see latency reported by the physical disk is much lower to what reported by the physical disk then one of the components located between CSVFS and the disk is enquing/serializing the IO.

Above you can see an example where you can see physical disk reported latency

\Cluster CSV File System(*)\Write Latency is 19 milliseconds
\Cluster CSV File System(*)\Redirected Write Latency is 19 milliseconds
\PhysicalDisk(*)\Avg. Disk sec/Write is 18 milliseconds

\Cluster CSV File System(*)\Read Latency is 24 milliseconds
\Cluster CSV File System(*)\Redirected Read Latency is 23 milliseconds
\PhysicalDisk(*)\Avg. Disk sec/Read is 23 milliseconds

Given statistical errors ant that snapshotting values of different counters is not synchronized we can ignore 1 milliseconds that CSVFS adds and we can say that most of the latency comes from the physical disk.

If you are on non-Coordinating node then you need to look at SMB Client Share performance counters for the volume share

\SMB Client Shares(*)\Avg sec/Write
\SMB Client Shares(*)\Avg sec/Read

After that look at the latency reported by the physical disk on the Coordinator node to see how much latency is coming from SMB itself

In the sample above you can see

\Cluster CSV File System(*)\Write Latency is 18 milliseconds
\Cluster CSV File System(*)\Redirected Write Latency is 18 milliseconds
\SMB Client Shares(*)\Avg sec/Write is 18 milliseconds

\Cluster CSV File System(*)\Read Latency is 23 milliseconds
\Cluster CSV File System(*)\Redirected Read Latency is 23 milliseconds
\SMB Client Shares(*)\Avg sec/Read is 22 milliseconds

In the sample before we’ve seen that disk read latency is 23 milliseconds and write latency is 18 milliseconds so we can conclude that disk is the biggest source of latency.

Is my disk the bottleneck?

To answer this question you need to look at the sum/average of the following performance counters across all cluster nodes that perform IO on this disk. Each node’s counters will tell you how much IO is done by this node, and you will need to do the math to find out the aggregate values.

\PhysicalDisk(*)\Avg. Disk Read Queue Length
\PhysicalDisk(*)\Avg. Disk Write Queue Length
\PhysicalDisk(*)\Avg. Disk sec/Read
\PhysicalDisk(*)\Avg. Disk sec/Write

You can play with different queue length changing load of the disk and checking against your target. There is really no right or wrong answer here and it all depends on what is your application expectations are.

In the sample above you can see that total IO queue length on the disk is about (8.951+8.538+10.336+10.5) 38.3, and average latency is about ((0.153+0.146+0.121+0.116)/4) 134 milliseconds.

Please note that physical disk number in this sample happens to be the same – 7 on both cluster nodes. You should not assume it will be the same. On the coordinator node you can find it using Cluster Administrator UI by looking at the Disk Number column.

Unfortunately there are no good tools to find it on the non-Coordinator node, the Disk Management mmc snap-in is the best tool.

To find physical disk number on all cluster nodes you can move the Cluster Disk from node to node writing down Disk Number on each node, but be careful especially when you have actual workload running because while moving volume CSV will pause all IOs, which will impact your workload throughput.

Is my network the bottleneck?

When you are looking at the cluster networks keep in mind that cluster splits networks into several categories and each type of traffic uses only some of the categories. You can read more about that in the following blog post http://blogs.msdn.com/b/clustering/archive/2011/06/17/10176338.aspx .

If you know the network bandwidth you can always do math to verify that it is large enough to be able to handle your load. But you should verify it in empirical way then before you put your cluster into production. I would suggest you use following steps:

Online disk on the cluster node and stress/saturate your disks by putting lots of non-cached IO on the disk. Monitor IOPS, and MBPS on the disk using PhysicalDisk performance counters

\PhysicalDisk(*)\Disk Reads/sec
\PhysicalDisk(*)\Disk Writes/sec
\PhysicalDisk(*)\Disk Read Bytes/sec
\PhysicalDisk(*)\Disk Write Bytes/sec

From another cluster node run the same test over SMB, and now monitor SMB Client Share performance counters
1. \SMB Client Shares(*)\Write Bytes/Sec
2. \SMB Client Shares(*)\Read Bytes/Sec
3. \SMB Client Shares(*)\Writes/Sec
4. \SMB Client Shares(*)\Reads/Sec

If you see that you are getting the same IOPS and MBPS then your bottleneck is the disk, and network is fine.

In case if you are using Scale-out File Server (SOFS) you can use similar way to verify that network is not the bottleneck, but in this case instead of PhysicalDisk counters use Cluster CSV File System performance counters

\Cluster CSV File System(*)\Reads/sec
\Cluster CSV File System(*)\Writes/sec

When using RDMA it is a good idea to verify that RDMA is actually working by looking at the \SMB Direct Connection(*)\* family of counters.

Performance Counters Summary

We went over 4 counter sets that are most useful when investigating CSVFS performance. Using PowerShell cmdlets we are able to see if Direct IO is possible. Using performance counters we can verify if IO is indeed is going according to our expectations, and by looking at counters at different layers we can find where the bottleneck is.

Remember that none of the performance counters that we talked about is aggregated across multiple cluster nodes, each of them provides one node view.

If you want to automate collection of performance counters from multiple node consider using this simple script http://blogs.msdn.com/b/clustering/archive/2009/10/30/9915526.aspx that is just a convenience wrapper around logman.exe. It is described in the following blog post http://blogs.msdn.com/b/clustering/archive/2009/11/10/9919999.aspx .

Thanks!
Vladimir Petter
Principal Software Development Engineer
Clustering & High-Availability
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

↧

Planning Failover Cluster Node Sizing

July 3, 2014, 9:32 am

≫ Next: Symantec ApplicationHA for Hyper-V

≪ Previous: Cluster Shared Volume Performance Counters

In this blog I will discuss considerations on planning the number of nodes in a Windows Server Failover Cluster.

Starting with Windows Server 2012, Failover Clustering can support to up 64-nodes in a single cluster making it industry leading in scale for a private cloud. While this is exciting, the reality is that it is probably bigger than the average person cares to do. There is also no limitation on cluster sizes in different versions of Windows Server (Standard vs. Datacenter, etc…). Since there is no practical limitation on scale for the average IT admin, then how many nodes should you deploy with your cluster? The primary consideration comes down to defining a fault domain. Let’s discuss the considerations…

Resiliency to Hardware Faults

When thinking about fault domains, hardware resiliency is one of the biggest considerations. Be it chassis, rack, or datacenter. Let’s start with blades as an example; you probably don’t want a chassis to be a single point of failure. To mitigate a chassis failure you probably want to span across multiple chassis. If you have eight blades per chassis, it would be desired for your nodes to reside across two different chassis for resiliency, so you create a 16-node cluster with eight nodes in each chassis. Or maybe you want to have rack resiliency, in that case create a cluster out of nodes that span multiple racks. The number of nodes in the cluster will be influenced by how many servers you have in the rack. If you want your cluster to achieve disaster recovery in addition to high availability, you will have nodes in the cluster which will span across datacenters. Defining fault domains can protect you from hardware class failures.

Multi-site Clusters

To expand upon the previous topic a little… when thinking about disaster recovery scenarios and having a Failover Cluster that can achieve not only high availability, but also disaster recovery you may span clusters across physical locations. Generally speaking, local failover is less expensive than site failover. Meaning that on site failover data replication needs to flip, IP’s may switch to different subnets, and failover times may be longer. In fact, switching over to another site may require IT leadership approval. When deploying a multi-site cluster it is recommended to scale up the number of nodes so that there are 2+ nodes in each site. The goal is that when there is a server failure, there is fast failover to a site local node. Then when there is a catastrophic site failure, services failover to the disaster recovery site. Defining multiple nodes per fault domain can give you better service level agreements.

All your Eggs in One Basket

There is no technical limitations which makes one size cluster better than another. While we hope that there is never a massive failure which results in an entire cluster to go down, some might point out that they have seen it happen… so there’s a matter of how many eggs do you want in one basket? By breaking up your clusters you can have multiple fault domains, where in the event of losing an entire cluster it also mitigates impact. So let’s say you have 1,000 VMs… if you have a single 32-node cluster, and the entire cluster goes down, that means all 1,000 VMs go down. Where if you had them broken into two 16-node clusters, then only 500 VMs (half) go down. Defining fault domains can protect you from cluster class failures.

Flexibility with a Larger Pool of Nodes

System Center Virtual Machine Manager has a feature called Dynamic Optimization which analyzes the load across the nodes and moves VMs around to load balance the cluster. The larger the cluster, the more nodes Dynamic Optimization has to work with and the better balancing it can achieve. So while creating multiple smaller clusters may divide up multiple fault domains, creating too small of clusters can increase the management and keep them from being utilized optimally. Defining a larger cluster creates finer granularity to spread and move across.

Greater Resiliency to Failures

The more nodes you have in your cluster, the less impactful losing each node becomes. So let’s say you create a bunch of little 2-node clusters, if you were to lose 2 nodes… all the VMs go down. Where if you had a 4-node cluster, you can lose 2 servers and the cluster stays up and keeps running. Again, this ties back more to the hardware fault domains discussion.

Another aspect of this is that when a node fails, the more surviving nodes you have to distribute the load across. So let’s say you have 2-nodes… if you lose 1 node, the surviving node is now running at 200% capacity (running everything it was before, and all the failed nodes as well). If you scale the number of nodes, the VMs can be spread out across more hosts, and the loss of an individual node is less impactful. If you have a 3-node cluster and lose a node, each node is operating at 150% capacity.

Another way to think of it, is how much stress do you want to put on yourself? If you have a 2-node cluster, and you lose a node… you will probably have a fire drill to urgently get that node fixed. Where if you lose a node in a 32-node cluster… you might be ok finishing your round of golf before you worry about it. Increasing scale can protect you from larger numbers of failures and makes an individual failure less impactful.

Diagnosability

Troubleshooting a large cluster may at times be more difficult than smaller clusters. Say for example you have a problem on your 64-node cluster, that may involve pulling and correlating logs across all 64 servers. This can be more complex and cumbersome to deal with the large number of logs. Another example is that the cluster Validation tool is a functional test tool, and will also take longer on larger clusters… when things go wrong and you want to check your cluster. Some IT admin’s prefer smaller fault domains when troubleshooting problems.

Workload

You also scale different types of clusters differently based on the workload they are running:

Hyper-V: You want your private cloud to be one fluid system where VMs are dynamically moving around and adjusting. Tools like SCVMM Dynamic Optimization really start to shine with larger clusters to monitor the load of the nodes and seamlessly move VMs around to optimize and load balance the cluster. Hyper-V clusters are usually the biggest and may have 16, 24, 32 or higher.
Scale-out File Server: File based storage for your applications with a SoFS should usually be 2 – 4 nodes. For example, internal Microsoft SoFS clusters are deployed with 4-nodes.
Traditional File Server: Traditional information worker File Clusters tend to also be smaller, again in the 2 – 4 node range.
SQL Server: Most SQL Server clusters deployed with a failover cluster instance (FCI) are 2-nodes, but that has more to do with SQL Server licensing and the ability to create a 2-node FCI with SQL Server standard edition. The other consideration is that each SQL instance requires a drive letter. So that’s a maximum of let’s say 24 instances. This is addressed with SQL Server 2014 support for Cluster Shared Volumes… but generally speaking, it doesn’t make much sense to deploy a 32-node SQL Server cluster. So think smaller… 2, 4, or maybe up to 8. A SQL cluster with an Availability Group (AG) is usually a multi-site cluster and will have more nodes than an FCI.

Conclusion

There’s no right or wrong answer here on how many nodes to have in your cluster, many other vendors have strong recommendations to work around limitations they may have… but those don’t apply to Windows Server Failover Clustering. It’s more about thinking about your fault domains, and personal preference for a large part. Big clusters are cool and come with serious bragging rights, but have some considerations… little clusters seem simple, but don’t really shine as well as they should… you will likely find the right fit for you somewhere in-between.

Thanks!
Elden Christensen
Principal Program Manager Lead
Clustering & High-Availability
Microsoft

↧

Symantec ApplicationHA for Hyper-V

August 28, 2014, 12:24 pm

≫ Next: Cluster Shared Volume Failure Handling

≪ Previous: Planning Failover Cluster Node Sizing

In previous blogs I had discussed the Failover Clustering VM Monitoring feature adding in Windows Server 2012. You can find these blogs here and here. Below is a guest blog describing Symantec ApplicationHA for Hyper-V which leverages the functionality provided by VM Monitoring.

Hi my name is Lorenzo Galelli, Senior Technical Product Manager at Symantec Corporation, here to talk about some exciting technologies from Symantec that focuses on Hyper-V and Failover Clustering. Before I jump into those new technologies I just wanted to say thanks to the Clustering and High Availability Windows team for the invite to write on their blog, thanks guys!

So let’s talks about the new and exciting technologies that I am sure will make your job easier when deploying applications within Hyper-V. ApplicationHA is the first technology that I would like to showcase, its actually been around for a couple of years providing application uptime on other hypervisors and with the release of our new version we have added support for Hyper-V running within Windows 2012 and Windows 2012 R2. So what’s so great about ApplicationHA I hear you ask, well a number of things. First, it monitors your applications running within the virtual machines and automatically remediates any faults that occur by trying to restart those. Second, it integrates with Failover Clustering specifically the heartbeat service and leverages a common set of APIs that we can hook into to assist remediation tasks. Also we removed a lot of the headaches revolving around configuration of availability where ApplicationHA will auto discover the majority of the application configuration so all that the admin needs to do is to decide what needs monitoring and with a couple clicks through the configuration wizard your all set. We also provide management and operations through a web interface and plan to have SCVMM extensibility in the coming release. So if you’re virtualizing SharePoint, Exchange, SQL, IIS, SAP or Oracle then we have a wizard to support that app along with many others as well as support for custom or in house applications.

Below is a diagram that explains how ApplicationHA for Hyper-V leverages the Microsoft Failover Cluster Heartbeat service which Microsoft added to Windows 2012, ApplicationHA leverages this heartbeat function to communicate with Failover Cluster that a heartbeat fault has occurred if ApplicationHA is unable to restart the application within the virtual machine, ApplicationHA will attempt to remediate the fault a number of times before it communicates with the heartbeat service.

Microsoft Failover Cluster detects issues with virtual machines if faults occur and moves the effected VM.
ApplicationHA detects issues with the application under control and attempts to restart the faulted application.
In the event that ApplicationHA is unable to start the application it instructs a heartbeat fault with Failover Cluster.
Failover Cluster reboots the VM or moves the VM to another host if the application still has issues starting.

For more information on ApplicationHA 6.1 for Hyper-V be sure to check out the new whitepaper which describes in detail how ApplicationHA works with Failover Cluster. http://www.symantec.com/connect/sites/default/files/White_Paper_Confidently_Virtualize_Business-critical_Applications_in_Microsoft_Hyper-V_with_Symantec_ApplicationHA.pdf

For more information on Symantec ApplicationHA be sure to check out the Symantec ApplicationHA website http://www.symantec.com/application-ha

Next up is Virtual Business Service which is an application availability multi-tier orchestration tool which provides the ability to link applications together and control those as a single entity. Applications can be hosted on physical as well as virtual machines and as long as the application is using Symantec availability solution like ApplicationHA or Microsoft Failover Clustering then you’re good to go.

If you want to review this capability in more detail I have posted a number of videos on Symantec User Group Forum, Symantec Connect which walks through the installation and configuration from start to finish.

http://www.symantec.com/connect/videos/applicationha-61-hyper-v-install-configure-and-manage-part-1

http://www.symantec.com/connect/videos/applicationha-61-hyper-v-install-configure-and-manage-part-2

http://www.symantec.com/connect/videos/applicationha-61-hyper-v-install-configure-and-manage-part-3

http://www.symantec.com/connect/videos/applicationha-61-hyper-v-install-configure-and-manage-part-4

↧

Cluster Shared Volume Failure Handling

October 27, 2014, 12:13 pm

≫ Next: vNext Failover Clustering in Windows Server Technical Preview

≪ Previous: Symantec ApplicationHA for Hyper-V

This is the fourth blog post in a series about Cluster Shared Volumes (CSV). In this post we will explain how CSV handles storage failures and how it hides failures from applications. This blog will build on prior knowledge with the assumption that the reader is familiar with the previous blog posts:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx

Which explains CSV components and different CSV IO modes.

Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Which explains tools that help you to understand why CSV volume uses one or another mode for IO.

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Which is a reference and guide to the CSV related performance counters.

Failure Handling

CSV is designed to increase availability by abstracting applications and make them resilient to failures of network, storage, and nodes. CSV accomplishes this by virtualizing file opens. When an application opens a file on CSVFS, this open is claimed by CSVFS. CSVFS then in turn opens another handle on NTFS. When handling failures CSVFS can reestablish its file open on NTFS while keeping the virtual handle to the application open on CSVFS valid. To better understand that, let’s look at how a hypothetical failure handling might go. We will do that with help of a diagram where we will remove many components that are part of CSV to keep the picture simple.

Let’s assume that we start in the state where disk is mounted on the Node 2, and there are applications running on both nodes using files on this CSV volume.

Let’s take a failure scenario where Node 2 loses connectivity to the Disk.

For instance this might be caused by HBA on that node going bad or by someone unintentionally misconfiguring the LUN masking while making another change. In this example, there are many different IO’s in flight at the moment of failure. For instance there might be File System or Block Redirected IO from Node 1 or any IO from Node 2 or any metadata IO. Because connectivity to the storage was lost, NTFS will start failing these IOs with status code indicating that device object has been removed. Once CSVFS observes failed IO it will switch volume to the Draining state.

When CSVFS switches itself to the Draining state because it has observed a failure from Disk or SMB, we refer to that as CSVFS “Autopauses”. This indicates that the volume has automatically put itself in a recovery state. For instance when a user invokes an action to move a physical disk resource from one cluster node to another, then CSV volume also will be put in Draining state. But in this case it does it because of an explicit administrative action and the volume is not considered to be in an Autopause in this case.

In the Draining state volume pends all new IOs and any failed IOs. Cluster will first put CSVFS for that volume into a ‘Draining’ state on all the nodes on the cluster. Once state transition to the draining is complete cluster will then tell CSVFS on all the nodes to move to the ‘Paused’ state. During transition to the paused state CSVFS will wait for all ongoing IOs to complete, and once all IO’s have completed so that there are no longer any IO’s is in flight, it will close the underlying file opens to NTFS. Meanwhile cluster will discover that path to the disk is gone and will dismount NTFS on the Node 2

Clustering has a component called the Storage Topology Manager (STM) which will has a view of all the nodes disk connectivity, it will discover that Node 1 can see the disk. Cluster will mount NTFS on the Node 1.

Once mount is done Cluster will tell CSVFS to transition to the ‘Set Down Level’ state. During that transition CSVFS re-opens files on NTFS. Once all nodes are in Set Down Level state Cluster tells CSVFS on all nodes to go to ‘Active’ state. While transitioning to the active state CSVFS will resume all paused IOs and will stop pausing any new IOs. From this point on CSV has fully recovered from the disk failure and is back to a fully operational state.

Applications running on CSVFS would perceive this failure as if IOs for some reason took longer than usual, but they will not observe the failure.

On the nodes where CSVFS observed the failure due to the disk disconnect, and automatically put itself to ‘Draining’ state (a.k.a Autopaused) before Cluster told it to do so you will see System Event log message 5120 which would look like this:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      5120
Task Category: Cluster Shared Volume
Level:         Error
Description:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer available on this node because of 'STATUS_VOLUME_DISMOUNTED(C000026E)'. All I/O will temporarily be queued until a path to the volume is reestablished.

In case if cluster was not able to recover from the failure and had to take CSVFS down you also will see System Event log message 5142

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      5142
Task Category: Cluster Shared Volume
Level:         Error
Description:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

Summary

CSV is a clustered file system which also helps increase availability by being resilient to underlying failures. In this blog post we went into detail how CSV abstracts storage failures from applications.

Thanks!
Vladimir Petter
Principal Software Development Engineer
Clustering & High-Availability
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

↧

vNext Failover Clustering in Windows Server Technical Preview

November 12, 2014, 9:28 am

≫ Next: Introducing Cloud Witness in Windows Server 2016

≪ Previous: Cluster Shared Volume Failure Handling

Interested in the new features coming for Failover Clustering? I recently stopped over at Channel 9 and did an interview where we discussed some of the big clustering and availability features in Windows Server Technical Preview.

Here's the link:
http://channel9.msdn.com/Shows/Edge/Edge-Show-125

Thanks!
Elden Christensen
Principal Program Manager Lead
Clustering & High-Availability
Microsoft

↧

Introducing Cloud Witness in Windows Server 2016

November 13, 2014, 7:58 pm

≫ Next: Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120

≪ Previous: vNext Failover Clustering in Windows Server Technical Preview

Available in release: Windows Server 2016

Author:
Amitabh P Tamhane
Senior Program Manager
Microsoft Corp.

Cloud Witness is a new type of Failover Cluster quorum witness being introduced in Windows Server 2016. In this blog, I intend to give a quick overview of Cloud Witness and the steps required to configure it.

Consider an example multi-site stretched Failover Cluster quorum configuration with Windows Server 2012 R2:

In this example configuration, there are 2 nodes in 2 datacenters (referred as Sites). Note, it is possible for cluster to span more than 2 datacenters as well as each datacenter can have many more than 2 nodes. A typical cluster quorum configuration in this setup (automatic failover SLA) would give each node a vote. And then we need one extra vote of the quorum witness to allow cluster to keep running even if either one of the datacenter experiences power outage. Math is simple: There are 5 total votes, and you need 3 votes for the cluster to keep running.

In case of power outage in one datacenter, to give equal opportunity for the cluster in other datacenter to keep running, it is recommended to host the quorum witness in a location other than the two datacenters. This typically means requiring a 3^rd separate datacenter (site) to host File Server that is backing the File Share which is used as the quorum witness (File Share Witness).

We received feedback from our customers, that most don’t have a 3^rd separate datacenter that will host File Server backing the File Share Witness. This means customers host the File Server in one of the two datacenters, by extension making that datacenter the primary datacenter. In a scenario where there is power outage in the primary datacenter, the cluster would go down as the other datacenter would only have 2 votes which is below the quorum majority of 3 votes. For the customers that have 3^rd separate datacenter to host the File Server, it is an overhead to maintain the highly available File Server backing the File Share Witness. Hosting VMs in public cloud that have the File Server for File Share Witness running in Guest OS is a significant overhead in terms of both setup & maintenance.

Introducing Cloud Witness

Cloud Witness is a new type of Failover Cluster quorum witness that leverages Microsoft Azure as the arbitration point. It uses Microsoft Azure Blob Storage to read/write a blob file which is then used as an arbitration point in case of split-brain resolution.

There are significant benefits which this approach:

Leverages Microsoft Azure (no need for 3^rd separate datacenter)
Uses standard publically available Microsoft Azure Blob Storage (no extra maintenance overhead of VMs hosted in public cloud)
Same Microsoft Azure Storage Account can be used for multiple clusters (one blob file per cluster; cluster unique id used as blob file name)
Very low on-going $cost to the Storage Account (very small data written per blob file, blob file updated only once when cluster nodes’ state changes)
Built-in Cloud Witness resource type

Multi-site stretched clusters with Cloud Witness:

Notice there is no 3^rd separate site that is required. Cloud Witness, like any other quorum witness, gets a vote and can participate in quorum calculations.

Cloud Witness: Single Witness Type for most scenarios

If you have a Failover Cluster deployment, where all nodes can reach the internet (by extension Microsoft Azure), it is recommended to configure Cloud Witness as your quorum witness resource.

Scenarios including (there may be more scenarios beyond the list below):

Disaster recovery stretched multi-site clusters (example above)
Failover clusters without shared storage (SQL Always On, Exchange DAGs, etc.)
Failover clusters running inside Guest OS hosted in Microsoft Azure VM Role (or any other public cloud)
Failover clusters running inside Guest OS of VMs hosted in Private Clouds
Storage clusters with or without shared storage (Scale-out File Server clusters, etc.)
Small branch-office clusters (even 2-node clusters)

Starting Windows Server 2012 R2, we recommend to “Always configure a Witness” as cluster automatically manages the witness vote, nodes vote with Dynamic Quorum. Cloud Witness complements this recommendation with “Always configure a Cloud Witness”.

For Failover Clusters that do not have access to internet, we recommend to continue to configure File Share Witness or Disk Witness as per your deployment configuration setup.

Creating Microsoft Azure Storage Account

To configure Cloud Witness, you would need a valid Microsoft Azure Storage Account which would be used to store the blob file (used for arbitration). Cloud Witness creates a well-known Container “msft-cloud-witness” under the Microsoft Storage Account. Cloud Witness writes a single blob file with corresponding cluster’s unique Id used as the file name of the blob file under this “msft-cloud-witness” container. This means same Microsoft Azure Storage Account can be used to configure a Cloud Witness for multiple different clusters.

When using the same Microsoft Azure Storage Account for configuring Cloud Witness for multiple different clusters, then you would notice there is a single “msft-cloud-witness” container that gets automatically created. And this container will contain one-blob file per cluster.

Microsoft Azure Management Portal: Creating Storage Account

When creating Microsoft Azure Storage Account, it is very important to select “Locally Redundant” for Replication Type. Failover Cluster uses the blob file as the arbitration point, which requires some consistency guarantees when reading the data.

Microsoft Azure Management Portal: Managing Access Keys

When you create a Microsoft Azure Storage Account, it is associated with two Access Keys that are automatically generated. For first-time creation of Cloud Witness, use “Primary Access Key”. There is no restriction in which key you want to use for Cloud Witness.

Microsoft Azure Management Portal: URL Links

When you create a Storage Account, the following URLs are generated using the format: https://<Storage Account Name>.<Storage Type>.<Endpoint>

For Cloud Witness, it always uses “Blob” as the storage type. Microsoft Azure uses “.core.windows.net” as the Endpoint. When configuring Cloud Witness, it is possible that you configure it with a different Endpoint as per your scenario (for example the Microsoft Azure in China has a different Endpoint).

Note: This URL is generated automatically by Cloud Witness resource and there is no extra step of configuration necessary for the URL. I added this section in the blog to give some context of how Cloud Witness reaches Microsoft Azure.

Microsoft Azure Management Portal: Container view

If you look under the “msft-cloud-witness” Container that gets created as part of the Storage Account, you would notice a blob getting created corresponding to the cluster GUID:

Configuring Cloud Witness with Failover Cluster Manager GUI

Cloud Witness configuration is well-integrated within the existing Quorum Configuration Wizard built into the Failover Cluster Manager GUI.

Select “Configure Cluster Quorum Settings”:

Then, select “Select the quorum witness”:

Then, select “Configure a cloud witness”:

On the Cloud Witness configuration page, enter the following information:

(Required parameter) Microsoft Storage Account Name
(Required parameter) Access Key corresponding to the Storage Account
1. When creating for the first time, use Primary Access Key (see above)
2. When rotating the Primary Access Key, use Secondary Access Key (see above)
(Optional parameter) If you intend to use a different Azure service endpoint (for example the Microsoft Azure service in China), then update the endpoint server name.

Configuring Cloud Witness with PowerShell

The existing Set-ClusterQuorum PowerShell command has new additional parameters corresponding to Cloud Witness.

You can configure Cloud Witness using PowerShell command:

Set-ClusterQuorum –CloudWitness –AccountName <StorageAccountName> -AccessKey <StorageAccountAccessKey>

In case you need to use a different endpoint (rare):

Set-ClusterQuorum –CloudWitness –AccountName <StorageAccountName> -AccessKey <StorageAccountAccessKey> -Endpoint <servername>

Failover Cluster Manager GUI: Cloud Witness

Upon successful configuration of Cloud Witness, you can view the newly created witness resource in the Failover Cluster Manager snap-in:

Microsoft Azure Storage Account considerations with Cloud Witness

Failover Cluster will not store the Access Key, but rather it will generate a Shared Access Security Token that is generated using the Access Key provided and stores this SAS Token securely.
The generated SAS Token is valid as long as the Access Key remains valid. When rotating the Primary Access Key, it is important to first update the Cloud Witness (on all your clusters that are using that Storage Account) with the Secondary Access Key before regenerating the Primary Access Key.
Cloud Witness uses HTTPS REST interface of the Microsoft Azure Storage Account service. This means it requires the HTTPS port to be open on all cluster nodes.

Thank you

↧

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120

December 8, 2014, 9:33 am

≫ Next: Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142

≪ Previous: Introducing Cloud Witness in Windows Server 2016

In the previous post http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx we have discussed how CSVFS abstracts failures from applications by going through the pause/resume state machine and we have also explained what the auto-pause is. Focus for this blog post will be auto-pauses.

CSV auto pauses when it receives any failure from Direct IO or Block Redirected IO with a few exceptions like STATUS_INVALID_USER_BUFFER, STATUS_CANCELLED, STATUS_DEVICE_DATA_ERROR or STATUS_VOLMGR_PACK_CONFIG_OFFLINE, which indicate either user error or that storage is misconfigured. In both cases there is no value in trying to abstract the failure in CSV because as soon as the IO is retried it will get the same error.

When File System Redirected IO fails (including any metadata IO) then CSV auto pauses only when error is one of the well knows status codes. Here is the list that we have as of Windows Server Technical Preview for vNext:

STATUS_BAD_NETWORK_PATH
STATUS_BAD_NETWORK_NAME
STATUS_CONNECTION_DISCONNECTED
STATUS_UNEXPECTED_NETWORK_ERROR
STATUS_NETWORK_UNREACHABLE
STATUS_IO_TIMEOUT
STATUS_CONNECTION_RESET
STATUS_CONNECTION_ABORTED
STATUS_NO_SUCH_DEVICE
STATUS_DEVICE_DOES_NOT_EXIST
STATUS_VOLUME_DISMOUNTED
STATUS_NETWORK_NAME_DELETED
STATUS_VOLMGR_VOLUME_LENGTH_INVALID
STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR
STATUS_LOGON_FAILURE
STATUS_NETWORK_SESSION_EXPIRED
STATUS_CLUSTER_CSV_VOLUME_DRAINING
STATUS_CLUSTER_CSV_VOLUME_DRAINING_SUCCEEDED_DOWNLEVEL
STATUS_DEVICE_BUSY
STATUS_DEVICE_NOT_CONNECTED
STATUS_CLUSTER_CSV_NO_SNAPSHOTS
STATUS_FT_WRITE_FAILURE
STATUS_USER_SESSION_DELETED

This list is based on our experience and many years of testing, and includes status codes that you would see when communication channel fails or when storage stack is failing. Please note that this list evolves and changes as we discover new scenarios that we can help to make more resilient using auto pause. This list contains status codes that indicate communication/authentication/configuration failure, status codes that indicate that NTFS/disk on coordinating node are failing, and few CSV specific status codes.

There are also few cases when CSV might auto-pause itself to handle some inconsistency that it observes in its state or when it cannot get to the desired state without compromising data correctness. An example would be when file is opened from multiple computers, and on write from one cluster node CSV needs to purge cache on another node, and that purge fails because someone has locked pages then we would auto-pause to see if retry would avoid the problem. In these cases you might see auto-pauses with status codes like STATUS_UNSUCCESSFUL or STATUS_PURGE_FAILED or STATUS_CACHE_PAGE_LOCKED.

When CSV conducts an auto pause, an event 5120 is written to the System event log. The description field will contain the specific status code that resulted in the auto pause.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      5120
Task Category: Cluster Shared Volume
Level:         Error
Description:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer available on this node because of 'STATUS_VOLUME_DISMOUNTED(C000026E)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Additional information is available in the CSV operational log channel Microsoft-Windows-FailoverClustering-CsvFs/Operational. This can be found in Event Viewer under ‘Applications and Services Logs \ Microsoft \ Windows \ FailoverClustering-CsvFs \ Operational’. Here is an Event 9296 logged to that channel:

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      9296
Task Category: Volume Autopause
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6bAE-4405-b328-fd9d123469b3} is autopaused. Status 0xC000026E. Source: Tunneled metadata IO

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    <EventData>
    <Data Name="Volume">0xffffe000badfb1b0</Data>
    <Data Name="VolumeId">{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name="CountersName">Volume1me3</Data>
    <Data Name="FromDirectIo">false</Data>
    <Data Name="Irp">0xffffcf800fb72990</Data>
    <Data Name="Status">0xc000026e</Data>
    <Data Name="Source">11</Data>
    <Data Name="Parameter1">0x0</Data>
    <Data Name="Parameter2">0x0</Data>
</EventData>
</Event>

In addition to status code the Event 9296 will contain the source of the auto-pause, and in some cases may contain additional parameters helping to further narrow down the scenario. Here is the complete list of sources.

Unknown
Tunneled metadata IO
Apply byte range lock on down-level file system
Remove all byte range locks
Remove byte range lock
Continues availability resume complete
Continues availability resume complete for paging file object
Continues availability set bypass
Continues availability suspend handle on close
Stop buffering on file close
Remove all byte range locks on file close
User requested
Purge on oplock break
Advance VDL on oplock break
Flush on oplock break
Memory allocation to stop buffering
Stopping buffering
Setting maximum oplock level
Oplock break acknowledge to CSV filter
Oplock break acknowledge
Downgrade buffering asynchronous
Oplock upgrade
Query oplock status
Single client notification complete
Single client notification stop oplock

Auto Pause due to STATUS_IO_TIMEOUT

One of common auto-pause reasons is STATUS_IO_TIMEOUT, because of intra-cluster communication over the network. This is happening when SMB client observes that an IO is taking over 1-4 minutes (depending on IO type). If IO times out then SMB client would attempt to fail IOs to another channel in multichannel configuration or if all channels are exhausted then it would fail IO back to the caller.

You can learn more about SMB multichannel in the following blog posts

Configuring IP Addresses and Dependencies for Multi-Subnet Clusters
http://blogs.msdn.com/b/clustering/archive/2011/01/05/10112055.aspx

Configuring IP Addresses and Dependencies for Multi-Subnet Clusters - Part II
http://blogs.msdn.com/b/clustering/archive/2011/01/19/10117423.aspx

Configuring IP Addresses and Dependencies for Multi-Subnet Clusters - Part III
http://blogs.msdn.com/b/clustering/archive/2011/08/31/10204142.aspx

Force network traffic through a specific NIC with SMB multichannel
http://blogs.msdn.com/b/emberger/archive/2014/09/15/force-network-traffic-through-a-specific-nic-with-smb-multichannel.aspx

On the diagram above you can see two node cluster, Node 2 is coordinator node. Let’s say Application running on Node 1 issued an IO or metadata operation that CSVFS forwarded to NTFS over SMB (follow the red path on the diagram above). Any of the components along the red path (network, file system drivers attached to NTFS, volume and disk drivers, software and hardware on the storage box, firmware on the disk) can take a long time. Once SMB Client sent IO it starts a timer. If IO does not complete in 1-4 minutes then SMB Client will suspect that there might be something wrong with the network. It will disconnect the socket and would retry all IOs using another socket on another channel. If all channels were tried then IO would fail with STATUS_IO_TIMEOUT. In case of CSV there are some internal controls (for example oplock request) that cannot be simply retried on another channel so SMB Client would fail them back to CSVFS, which would trigger an auto pause with STATUS_IO_TIMEOUT.

Please note that CSVFS on the coordinating node would not use SMB to communicate to NTFS so these IOs would not complete with STATUS_IO_TIMEOUT from SMB Client.

The next question is how we can find what operation is taking time, and why?

First please note that auto pause with STATUS_IO_TIMEOUT would be reported on a non-coordinating node (Node 1 on the diagram above) while IO is stuck on the coordinating node (Node 2 on the diagram above).

Second please note that the nature of the issue we are dealing with is a hang, and in this case traces are not particular helpful because in the traces it is hard to tell what activity took time, and where it was stuck. We found two approaches to be helpful when troubleshooting this sorts of issues:

Collect a dump file on the coordinating node while hanging IO is in flight. There are number of options how you can create a dump file from most brutal:

Bugchecking your machine using sysinternals notmyfault (http://technet.microsoft.com/en-us/sysinternals/bb963901)
Configuring KD and using sysnternals livekd (http://technet.microsoft.com/en-us/sysinternals/bb897415.aspx )
Windbg. In fact this approach was so productive that starting from Windows Server Technical Preview cluster on observing an auto-pause due to STATUS_IO_TIMEOUT on non-coordinating node would collect kernel live dump on the coordinating node. We can open dump file using windbg (http://msdn.microsoft.com/en-us/library/windows/hardware/ff551063(v=vs.85).aspx ) and try to find out what IO is taking long time and why.

On the coordinating node keep running Windows Performance Toolkit (http://msdn.microsoft.com/en-us/library/windows/apps/dn391696.aspx ) session with wait analysis enabled (http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-43-WPT-Wait-Analysis ). When non-coordinating node auto pauses with STATUS_IO_TIMEOUT stop WPT session and collect etl file. Open etl using WPA and try to locate the IO that is taking long time, the thread that is executing this IO and what this thread has been blocked on. In some cases it might be helpful to also keep WPT sampling profiler enabled in cases if thread that is handling IO is not stuck forever, but periodically makes some forward progress.

The reason for STATUS_IO_TIMEOUT might very from software, configuration, to hardware issue. Always check your system event log for events indicating HBA or disk failures. Make sure you have all the latest updates.

Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters
http://support.microsoft.com/kb/2920151

Recommended hotfixes and updates for Windows Server 2012-based failover clusters
http://support.microsoft.com/kb/2784261

Make sure your storage and disks have latest firmware supported for your environment. If it is not going away then troubleshoot it using one of the ways described above and analyze the dump or trace.

It is expected that you may at times see Event 5120’s in the System event log, I would suggest not to worry about infrequent 5120’s as long it is happening once is a while (once a month or once a week), if cluster recovers from that, and you do not see workload failures. But I would suggest to monitor them and do some data mining for frequency and type (source and status code) of auto pauses. In some scenarios, an Event 5120 may be expected. This blog is an example of when an Event 5120 is expected during snapshot deletion: http://blogs.msdn.com/b/clustering/archive/2014/02/26/10503497.aspx

For instance if you see that frequency of auto pauses increased after certain date then check perhaps you have installed or enabled certain feature that was not on or was not used before.

You might be able to correlate auto-pause with some other activity that was happening on one of the cluster nodes around the same time. For example backup or antivirus scan.

Or perhaps you see auto pause is happening only when certain node is coordinating. Then there might be some issue with hardware on that node.

Or perhaps physical disk is going bad causing failure then try to look for storage errors in the system event log and query disk resiliency counters using powershell

Get-PhysicalDisk | Get-StorageReliabilityCounter | ft DeviceId,ReadErrorsTotal,ReadLatencyMax,WriteErrorsTotal,WriteLatencyMax -AutoSize

The list above is not exhaustive, but might give you some idea on how to approach the problem.

Summary

In this blog post we went over possible causes for the event 5120, what they might mean and how to approach troubleshooting. Windows Server has plenty of tools that would help you with troubleshooting 5120. Keep in mind that 5120 does not mean that your workload failed. Most likely cluster will successfully recover from that failure, and your workload will keep running. If recovery was not successful when you see event 5142, and that will be the subject of the next post.

Thanks!
Vladimir Petter
Principal Software Engineer
Clustering & High-Availability
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

↧

Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142

March 26, 2015, 11:56 am

≫ Next: Failover Clustering Sessions @ Ignite 2015 in Chicago

≪ Previous: Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120

In the last post http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx we discussed event 5120, which indicates that Cluster Shared Volumes (CSV) observed and error and attempted to recover. In this post we will discuss cases when recovery does not succeed. When CSV recovery does not succeed, an Event 5142 is logged to the System event log.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Event ID:      5142
Task Category: Cluster Shared Volume
Level:         Error
Description:
Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

In this post we will go over several possible root causes which may result in a 5142, and how to identify which issue are you hitting.

Cluster Service Failed

When the Cluster Service fails on a node, the Cluster Shared Volumes file system (CSVFS) will invalidate all file objects on all the volumes on that node. You may not see an event 5142 in this case because cluster may not have an opportunity to log it due to the service failed. You can find these cases by scanning Microsoft-Windows-FailoverClustering-CsvFs/Operational for the following sequence of events

The first 8960 event indicates that CSVFS is moving the volume to Init state and that DcmSequenceIs is empty, which means that this command did not come from the cluster service. CSVFS initiated this activity on its own

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      8960
Task Category: Volume State Change Started
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} transitioning from Init to Init.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<EventData>
    <Data Name="Volume">0xffffe000badfb1b0</Data>
    <Data Name="VolumeId">{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name="CurrentState">0</Data>
    <Data Name="NewState">0</Data>
    <Data Name="DcmSequenceId">
    </Data>
</EventData>
</Event>

Next 9216 event tells us that CSVFS successfully finished transition to Init state.

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      9216
Task Category: Volume State Change Completed
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} moved to state Init. Reason Transition to Init; Status 0x0.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<EventData>
    <Data Name="Volume">0xffffe000badfb1b0</Data>
    <Data Name="VolumeId">{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name="State">0</Data>
    <Data Name="Source">8</Data>
    <Data Name="Status">0x0</Data>
    <Data Name="DcmSequenceId">
    </Data>
</EventData>
</Event>

And finally an event 49152 is logged which provides details why this transitions was done, in this case because CSVFS observed that the Cluster Service is terminating.

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      49152
Task Category: ClusterDisconnected
Level:         Information
Keywords:      ClusterServiceState
Description:
Cluster service disconnected.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<EventData>
    <Data Name="FileObject">0xffffe000bab597c0</Data>
    <Data Name="ProcessId">0x1070</Data>
</EventData>
</Event>

In the cluster logs every time cluster starts you will see a log statement which contains “--------------+” so you can look for the last statement from the previous cluster instance to see what was the last thing cluster service have been doing right before terminating.

Here is what was logged in the cluster log when I terminated Cluster Service:
00001070.000015e0::2014/10/23-00:12:29.885 DBG [API] s_ApiOpenKey: "ServerForNFS\ReadConfig" failed with error 2

And then the ClusSvc was started again:
00000f10.00000fa4::2014/10/23-00:13:03.287 INFO -----------------------------+ LOG BEGIN +-----------------------------

In the event that that it is unknown why the cluster service terminated, you can start reading cluster logs from this point back trying to understand why cluster service went down.

When cluster fails on one of the nodes it will cause CSV volumes on this node to go down. CSV volumes will stay up on the other nodes. If node with failed clussvc was the coordinator then CSV on the other nodes will be paused until cluster fails over and onlines disk on a surviving node.

Disk Failure or Offline

When cluster exhausts all restart attempts to online a disk after too many failed attempts or when user manually offlines the disk, cluster will move CSV volumes corresponding to this disk to the Init state.

For instance if you offline a disk using Failover Cluster Manager then in the Microsoft-Windows-FailoverClustering-CsvFs/Operational channel we would see following events

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      8960
Task Category: Volume State Change Started
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} transitioning from Active to Init.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<EventData>
    <Data Name="Volume">0xffffe000885581b0</Data>
    <Data Name="VolumeId">{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name="CurrentState">4</Data>
    <Data Name="NewState">0</Data>
    <Data Name="DcmSequenceId"><1:60129542151><60129542147></Data>
</EventData>
</Event>

Log Name:      Microsoft-Windows-FailoverClustering-CsvFs/Operational
Source:        Microsoft-Windows-FailoverClustering-CsvFs-Diagnostic
Event ID:      9216
Task Category: Volume State Change Completed
Level:         Information
Keywords:      Volume State
Description:
Volume {ca4ce06f-6b06-4405-b058-fd9d1cf869b3} moved to state Init. Reson Transition to Init; Status 0x0.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<EventData>
    <Data Name="Volume">0xffffe000885581b0</Data>
    <Data Name="VolumeId">{CA4CE06F-6B06-4405-B058-FD9D1CF869B3}</Data>
    <Data Name="State">0</Data>
    <Data Name="Source">8</Data>
    <Data Name="Status">0x0</Data>
    <Data Name="DcmSequenceId"><1:60129542151><60129542147></Data>
</EventData>
</Event>

DcmSequenceId is not empty, so this means that the command came to CSVFS from the cluster. Using DcmSequenceId <1:60129542151><60129542147> you can correlate this to the place in the cluster log where cluster service initiated that state transition
[Verbose] 000004dc.00001668::2014/10/23-00:57:00.587 INFO [DCM] FilterAgent: ChangeCsvFsState: uniqueId ca4ce06f-6b06-4405-b058-fd9d1cf869b3, state CsvFsVolumeStateInit, sequence <1:60129542151><60129542147>

And working from this point backwards I see that there was a manual offline of the disk
[Verbose] 000004dc.000012dc::2014/10/23-00:57:00.527 INFO [RCM] rcm::RcmApi::OfflineResource: (Cluster Disk 3, 0)

If disk is failing then you will find in the cluster logs records about that and the reason for the failure.

Volume Is Failing Too Often

If cluster observes that a CSV volume on one of the nodes is failing too often, and is unable to stay without interruption in a good state for 5 minutes without running into auto pause then the volume will be taken down on that node. On the other nodes the volume will remain active. After several minutes cluster will attempt to revive the volume back to active state. In this case in the cluster logs you would see statements similar to this:
00000ab8.0000109c::2014/10/21-02:36:23.388 INFO [DCM] UnmapPolicy::enter_CountingToBad(aec3c2e8-a7eb-45e9-9509-f63190659ba4): goodTimer P0...75, badTimer R0...150, badCounter 1 state CountingToBad

00000ab8.0000109c::2014/10/21-02:36:23.544 INFO [DCM] CsvFs Listener: state [volume aec3c2e8-a7eb-45e9-9509-f63190659ba4, sequence <><145>, state CsvFsVolumeStateChangeFromInit->CsvFsVolumeStateInit, status 0x0]

And in the Microsoft-Windows-FailoverClustering-CsvFs/Operational you will see correlating events 8960 and 9216 with matching DcmSequenceId - <><145>. Note that first part of the sequence Id is empty. This is because the action is not global for all nodes, but is only for the current node. In general sequence format is:

Recovery Is Taking Too Long

Disk Is Timing Out During Online

Once CSVFS completes a state transition it waits for the cluster to start the next one, but CSVFS will not wait indefinitely. Depending on the volume type and state CSVFS will wait from 1 to 10 minutes. For example with a snapshot volume CSVFS would wait for only 1 minute. For volume which has some dirty pages in the file system cache CSVFS would wait for 3 minutes. For a volume which has no dirty pages CSVFS would wait up to 10 minutes. If cluster does not start the next state transition in that time then CSVFS will move volume to Init state.

In the Microsoft-Windows-FailoverClustering-CsvFs/Operational you will see events 8960 and 9216 similar to the case when cluster service was terminated. DcmSequenceId will be empty because state change was initiated by CSVFS. You will NOT see event 49152 saying that cluster service has disconnected. In the cluster logs you will see a log record that volume went to Init state, and sequence number will be empty.
00000af0.00000f88::2014/10/17-06:37:58.325 INFO [DCM] CsvFs Listener: state [volume aec3c2e8-a7eb-45e9-9509-f63190659ba4, sequence, state CsvFsVolumeStateChangeFromInit->CsvFsVolumeStateInit, status 0x0]

The next step is to find the node which owns the cluster Physical Disk resource at the moment of the failure and use cluster logs to try to identify why the disk online is taking so much time.

CSV State Transition Is Taking Too Long

This case is similar to the case above – CSV transitioned the volume to Init state because it timed out waiting for cluster to start next state transition. The reason why this happens may vary. In this case the disk may be staying online and healthy the whole time, but CSVFS on another node might take too long to finish its state transitions. The result will be similar, and events that you find in CSVFS logs and cluster logs will be similar

Summary

In this blog post we went over common reasons for the event 5142 when the Cluster Service fails to recover CSV, and you see your workloads failing. This blog post explained the sequences of events you will see in this case in the CSVFS operational channel, and how to correlate them with cluster logs.

Thanks!
Vladimir Petter
Principal Software Engineer
High-Availability & Storage
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx

↧

Failover Clustering Sessions @ Ignite 2015 in Chicago

April 30, 2015, 10:10 am

≫ Next: Invitation: Provide feedback, comments, and vote on Cluster UserVoice page

≪ Previous: Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142

If you are going to Ignite 2015 in Chicago next week, here are the cluster related sessions you might want to check out. We'll be talking about some exciting new enhancements coming in vNext. Weren't able to make it this year? Don't worry, you can stream live all the sessions and they are also recorded so that you can watch them any time at http://channel9.msdn.com/

BRK3474 - Enabling Private Cloud Storage Using Servers With Local Disks
Have you ever wanted to build a Scale-Out File Server using shared nothing Direct Attached Storage (DAS) hardware like SATA or NVMe disks? We cover advances in Microsoft Software Defined Storage that enables service providers to build Scale-Out File Servers using Storage Spaces with shared nothing DAS hardware.

BRK3484 - Upgrading Your Private Cloud to Windows Server 2012 R2 and Beyond!
We are moving fast, and want to help you to keep on top of the latest technology! This session covers the features and capabilities that will enable you to upgrade to Windows Server 2012 R2 and to Windows Server vNext with the least disruption. Understand cluster role migration, cross version live migration, rolling upgrades, and more.

BRK3487 - Stretching Failover Clusters and Using Storage Replica in Windows Server vNext
In this session we discuss the deployment considerations of taking a Windows Server Failover Cluster and stretching across sites to achieve disaster recovery. This session discusses the networking, storage, and quorum model considerations. This session also discusses new enhancements coming in vNext to enable multi-site clusters.

BRK3489 - Exploring Storage Replica in Windows Server vNext
Delivering business continuity involves more than just high availability, it means disaster preparedness. In this session, we discuss the new Storage Replica feature, including scenarios, architecture, requirements, and demos. Along with our new stretch cluster option, it also covers use of Storage Replica in cluster-to-cluster and non-clustered scenarios. And we have swag!

BRK3558 - Microsoft SQL Server End-to-End High Availability and Disaster Recovery
In this session we look at options which are available to the administrator of a Microsoft SQL Server 2014 database server so that the system can provide the 99.99% or higher uptime that customers demand. These options include Failover Cluster Instances, as well as AlwaysOn Availability Groups within a single site, stretching across multiple sites, as well as stretching into the Microsoft Azure public cloud. Learn when to use each technique, how to decide which option to implement, and how to implement these solutions.

BRK4105 - Under the Hood with DAGs
Join this session to learn from the DAG master Tim McMichael. The session examines how an Microsoft Exchange 2013 Database Availability Group leverages the Windows Failover Clustering service. As a bonus, it provides a sneak peek at how this will evolve with Exchange Server vNext and Windows 10. It also explores registry replication, cluster networking, and cluster features such as dynamic quorum and dynamic witness. After this session, administrators should have understanding of cluster integration and basic support knowledge.

BRK3496 - Deploying Private Cloud Storage with Dell Servers and Windows Server vNext
The storage industry is going through strategic tectonic shifts. In this session, we’ll walk through Dell’s participation in the Microsoft Software Defined Storage journey and how cloud scale scenarios are shaping solutions. We will provide technical guidance for building Storage Spaces in Windows Server vNext clusters on the Dell PowerEdge R730xd platform.

↧

Invitation: Provide feedback, comments, and vote on Cluster UserVoice page

May 11, 2015, 11:50 pm

≫ Next: Windows Server 2016 Failover Cluster Troubleshooting Enhancements - Cluster Log

≪ Previous: Failover Clustering Sessions @ Ignite 2015 in Chicago

The clustering team has a new UserVoice page here: http://windowsserver.uservoice.com/forums/295074-clustering that is part of the Windows Server UserVoice page: http://windowsserver.uservoice.com/forums/295047-general-feedback.

We welcome your feedback, comments, and votes – we would like to make Windows Server 2016 the best operating system for you and your customers.

PS – You can find Windows Server 2016 Technical Preview 2 (TP2) here: http://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-technical-preview

-RH.

↧