Quantcast
Channel: Clustering and High-Availability
Viewing all 111 articles
Browse latest View live

Windows Server 2016 Failover Cluster Troubleshooting Enhancements - Cluster Log

$
0
0

Cluster Log Enhancements

 This is the first in a series of Blogs that will provide details about the improvements we have made in the tools and methods for troubleshooting Failover Clusters with Windows Server 2016.

Failover Cluster has diagnostic logs running on each server that allow in-depth troubleshooting of problems without having to reproduce the issue. This log is valuable for Microsoft’s support as well as those out there who have expertise at troubleshooting failover clusters. 

Tip: Always go to the System event log first, when troubleshooting an issue. Failover cluster posts events in the System event log that are often enough to understand the nature and scope of the problem. It also gives you the specific date/time of the problem, which is useful if you do look at other event logs or dig into the cluster.log if needed.

Generating the Cluster.log

This is not new, but will be helpful information for those that aren’t familiar with generating the cluster log.

Get-ClusterLog is the Windows PowerShell cmdlet that will generate the cluster.log on each server that is a member of the cluster and is currently running. The output looks like this on a 3 node cluster:

The Cluster.log files can be found in the <systemroot>\cluster\reports directory (usually c:\windows\cluster\Reports) on each node.

You can use the –Destination parameter to cause the files to be copied to a specified directory with the Server’s name appended to the log name, which makes it much easier to get and analyze logs from multiple servers:

 

Other useful parameters are discussed in the rest of this blog.

What’s New

I’m going to highlight the enhancements to the information in the Windows Server 2016 cluster.log that will be the most interesting and useful to the general audience interested in troubleshooting failover clusters, and leave detailing every item in the log to a future blog(s).  I’m including references and links to resources related to troubleshooting clusters and using the cluster log at the end of this blog. 

TimeZone Information

The cluster.log is a dump of the information from the system and captured in a text file. The time stamps default to UTC (which some people call GMT).  Therefore if you are in a time zone that is UTC+8 you need to look at the time stamp in the cluster log and add 8 hours. For instance, if you are in that time zone and a problem occurred at 1:38pm (13:38), UTC time stamp in the cluster log would be (21:38).

We offer 2 enhancements in the cluster.log that makes this time zone and UTC offset easier to discover and work with:

  1. UTC offset of the server: The Top of the cluster.log notes the UTC offset of the originating server.  In the example below, it notes that the server is set to UTC + a 7 hour offset (420 minutes).  Specifically noting this offset in the log removes the guesswork related to the system’s time zone setting.
  2. Cluster log uses UTC or local time. The top of the cluster.log notes whether the log was created using UTC or local time for the timestamps.  The  –UseLocalTime parameter for Get-ClusterLog causes the cluster.log to write timestamps that are already adjusted for the server’s time zone instead of using UTC.  This is not new, but it became obvious that it’s helpful to know if that parameter was used or not, so it’s noted in the log.

[===Cluster ===]

UTC= localtime + time zone offset; with daylight savings, the time zone offset of this machine is 420 minutes, or 7 hours

The logs were generated using Coordinated Universal Time (UTC). 'Get-ClusterLog -UseLocalTime' will generate in local time.

Tip:The sections of the cluster.log are encased in [===   ===], which makes it easy to navigate down the log to each section by doing a find on “[===”.  As a bit of trivia, this format was chosen because it kind of looks like a Tie Fighter and we thought it looked cool.

Cluster Objects

The cluster has objects that are part of its configuration.  Getting the details of these objects can be useful in diagnosing problems.  These objects include resources, groups, resource types, nodes, networks, network interfaces, and volumes.  The cluster.log now dumps these objects in a Comma Separated Values list with headers. 
Here is an example:

[===Networks ===]

Name,Id,description,role,transport,ignore,AssociatedInterfaces,PrefixList,address,addressMask,ipV6Address,state,linkSpeed,rdmaCapable,rssCapable,autoMetric,metric,

Cluster Network 1,27f2d19b-7e23-4ee3-a226-287d4ebe9113,,1,TCP/IP,false,82e5107c-5375-473a-ab9f-5b6450bf5c7f30ff5ff6-00a3-494b-84b6-62a27ef99bb3 187c582d-f23c-48f4-8c37-6a452b2a238b,10.10.1.0/24 ,10.10.1.0,255.255.255.0,,3,1000000000,false,false,true,39984,

Cluster Network 2,e6efd1f6-474b-410a-bd7b-5ece99476cd8,,1,TCP/IP,false,57d9b74d-8d9e-4afe-8667-e91e0bd23412617bb075-3803-4e5e-a039-db513d60603d 51c4fd42-9cb4-4f2e-a65c-01fea9bfa582,10.10.3.0/24 ,10.10.3.0,255.255.255.0,,3,1000000000,false,false,true,39985,

Cluster Network 3,1a5029c7-7961-40bb-b6b9-dcbbe4187034,,3,TCP/IP,false,d3cdef35-82bc-4a60-8ed4-5c2b278f7c0e83c7c4b8-b588-425c-bfae-0c69d7a45bcd c1fb12d2-071b-4cb2-8ca7-fa04e972cd1c,157.59.132.0/22 2001:4898:28:4::/64,157.59.132.0,255.255.252.0,2001:4898:28:4::,3,100000000,false,false,true,80000,

These sections can be consumed by any application that can parse CSV text.  Or, you can copy/paste into an Excel spreadsheet, which makes it easier to read as well as provides filter/sort/search.  For the example below, I pasted the above section into a spreadsheet and then used the “Text to Columns” action in the “DATA” tab of Microsoft’s Excel.

 

New Verbose Log

New for Windows Server 2016 is the DiagnosticVerbose event channel.  This is a new channel that is in addition to the Diagnostic channel for FailoverClustering. 

In most cases the diagnostic channel, with the default log level set to the default of 3, gets enough information that an expert troubleshooter or Microsoft’s support engineers can understand a problem.  However, there are occasions where we need more verbose logging and it’s necessary to set the cluster log level to 5, causing the diagnostic channel to start adding the verbose level of events to the log.  After changing the log level you have to reproduce the problem and analyze the logs again. 

The question arises, why don’t we suggest keeping the log level at 5?  The answer is that it causes the logs to have more events and therefore wrap faster.  Being able to go back for hours or days in the logs is also desirable so the quicker wrapping poses its own troubleshooting problem.

To accommodate wanting verbose logging for the most recent time frame, and having logging that provides adequate history, we implemented a parallel diagnostic channel we call DiagnoticVerbose.  The DiagnosticVerbose log is always set for the equivalent of the cluster log level 5 (verbose) and runs in parallel to the Diagnostic channel for FailoverClustering.

You can find the DiagnosticVerbose section in the cluster.log by doing a find on “DiagnosticeVerbose”.  It will go the section header:

[=== Microsoft-Windows-FailoverClustering/DiagnosticVerbose ===]

[Verbose] 00000244.00001644::2015/04/22-01:04:29.623 DBG  
[RCM] rcm::PreemptionTracker::GetPreemptedGroups()

[Verbose] 00000244.00001644::2015/04/22-01:04:29.623 DBG  
[RCM] got asked for preempted groups, returning 0 records

 

The Diagnostic channel (default log level of 3) can be found by doing a find on “Cluster Logs”:

[=== Cluster Logs ===]

00000e68.00000cfc::2015/03/23-22:12:24.682 DBG   [NETFTAPI] received NsiInitialNotification

00000e68.00000cfc::2015/03/23-22:12:24.684 DBG   [NETFTAPI] received NsiInitialNotification

Events From Other Channels

There is a “Tip” above that notes the recommendation to start in the system event log first.  However, it’s not uncommon for someone to generate the cluster logs and send them to their internal 3rd tier support or to other experts.  Going back and getting the system or other event logs that may be useful in diagnosing the problem can take time, and sometimes the logs have already wrapped or have been cleared. 

New in Windows Server 2016 cluster log, the following event channels will also be dumped into the cluster.log for each node.  Since they are all in one file, you no longer need to go to the nodes and pull each log individually.

 

[=== System ===]

[=== Microsoft-Windows-FailoverClustering/Operational logs ===]

[=== Microsoft-Windows-ClusterAwareUpdating-Management/Admin logs ===]

[=== Microsoft-Windows-ClusterAwareUpdating/Admin logs ===]

 

Here is an example:

[=== System ===]

[System]
00000244.00001b3c::2015/03/24-19:46:34.671 ERR  
Cluster resource 'Virtual Machine <name>' of type 'Virtual
Machine' in clustered role '<name>' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

[System] 00000244.000016dc::2015/04/14-23:43:09.458 INFO The Cluster service has changed the password of account 'CLIUSR' on node '<node name>'.

 

Tip: If the size of the cluster.log file is bigger than you desire, the –TimeSpan switch for Get-ClusterLog will limit how far back (in minutes) it will go back in time for the events.  For instance, Get-Clusterlog –TimeSpan 10 will cause the cluster.log on each node to be created and only include events from the last 10 minutes.  That includes the Diagnostic, DiagnosticVerbose, and other channels that are included in the report.

Cluster.log References:

Troubleshooting Windows Server 2012 Failover Clusters, How to get to the root of the problem:  http://windowsitpro.com/windows-server-2012/troubleshooting-windows-server-2012-failover-clusters

Get-ClusterLog: https://technet.microsoft.com/en-us/library/hh847315.aspx

Set-ClusterLog: https://technet.microsoft.com/en-us/library/ee461043.aspx


Windows Server 2016 Failover Cluster Troubleshooting Enhancements - Active Dump

$
0
0

Active Dump

The following enhancement is not specific to Failover Cluster or even Windows Server.  However, it has significant advantages when you are troubleshooting and getting memory.dmp files from servers running Hyper-V.

Memory Dump Enhancement – Active memory dump

Servers that are used as Hyper-V hosts tend to have a significant amount of RAM and a complete memory dump includes processor state as well as a dump of what is in RAM and this results in the dmp file for a Full Dump to be extremely large.  On these Hyper-V
hosts, the parent partition is usually a small percentage of the overall RAM of the system, with the majority of the RAM allocated to Virtual Machines(VMs).  It’s the parent partition memory that is interesting in debugging a bugcheck or other bluescreen and the VM
memory pages are not important for diagnosing most problems. 

Windows Server 2016 introduces a dump type of “Active memory dump”, which filters out most memory pages allocated to VMs and therefore makes the memory.dmp much smaller and easier to save/copy. 

As an example, I have a system with 16GB of RAM running Hyper-V and I initiated bluescreens with different crash dump settings to see what the resulting memory.dmp file size would be.  I also tried “Active memory dump” with no VMs running and with 2 VMS taking up 8 of the 16GB of memory to see how effective it would be:

 

Memory.dmp in KB

% Compared to Complete

Complete Dump:

16,683,673

 

Active Dump (no VMs):

1,586,493

10%

Active Dump (VMs with 8GB RAM total):

1,629,497

10%

Kernel Dump (VMs with 8GB RAM total)

582,261

3%

Automatic Dump (VMs with 8GB RAM total)

587,941

4%

*The size of the Active Dump as compared to a complete dump will vary depending on the total host memory and what is running on the system. 

In looking at the numbers in the table above, keep in mind that the Active Dump is larger than the kernel, but includes the usermode space of the parent partition, while being 10% of the size of the complete dump that would have normally been required to get the usermode space.

Configuration

The new dump type can be chosen through the Startup and Recovery dialog as shown here:  
  


 

The memory.dmp type can also be set through the registry under the following key.  The change will not take effect until the system is restarted if changing it directly in the registry:  HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl\

Note: Information on setting memory dump types directly in the registry for previous versions can be found in a blog here.

To configure the Active memory.dmp there are 2 values that need to be set, both are REG_DWORD values.

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl\CrashDumpEnabled

The CrashDumpEnabled value needs to be 1, which is the same as a complete dump.

And

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl\FilterPages.

The FilterPages value needs to be set to 1.

Note: FilterPages value will not found under the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\CrashControl\ key unless the GUI “Startup and Recovery” dialog is used to set the dump type
to “Active memory dump”, or you manually create and set the value.

 

If you would like to set this via Windows PowerShell, here
is the flow and example:

  1. Gets the value of CrashDumpEnabled
  2. Sets the value of CrashDumpEnabled to 1 (so effectively this is now set to Complete dump).
  3. Gets the value of FilterPages (note that there is an error because this value doesn’t exist yet).
  4. Sets the value of FilterPages to 1 (this changes it from Complete dump to Active dump)
  5. Gets the value of FilterPages to verify it was set correctly and exists now.

 

Here is TXT version of what is showing above, to make it easier to copy/paste:

Get-ItemProperty –Path HKLM:\System\CurrentControlSet\Control\CrashControl –Name CrashDumpEnabled

Get-ItemProperty –Path HKLM:\System\CurrentControlSet\Control\CrashControl –Name FilterPages

Set-ItemProperty –Path HKLM:\System\CurrentControlSet\Control\CrashControl –Name CrashDumpEnabled –value 1

Set-ItemProperty –Path HKLM:\System\CurrentControlSet\Control\CrashControl –Name FilterPages –value 1

Testing Storage Spaces Direct using Windows Server 2016 virtual machines

$
0
0

Windows Server Technical Preview 2 introduces Storage Spaces Direct (S2D), which enables building highly available(HA) storage systems with local storage. This is a significant step forward in Microsoft Windows Server software-defined storage (SDS) as it simplifies the deployment and management of SDS systems and also unlocks use of new classes of disk devices, such as SATA disk devices, that were previously unavailable to clustered Storage Spaces with shared disks. The following document has more details about the technology, functionality, and how to deploy on physical hardware.

Storage Spaces Direct Experience and Installation Guide

That experience and install guide notes that to be reliable and perform well in production, you need specific hardware (see the document for details).  However, we recognize that you may want to experiment and kick the tires a bit in a test environment, before you go and purchase hardware. Therefore, as long as you understand it’s for basic testing and getting to know the feature, we are OK with you configuring it inside of Virtual Machines. 

If you want to verify specific capabilities, performance, and reliability, you will need to work with your hardware vendor to acquire approved servers and configuration requirements.

Assumptions for this Blog

-         You have a working knowledge of how to configure and manage Virtual Machines (VMs).

-         You have a basic knowledge of Windows Server Failover Clustering (cluster).

Pre-requisites

-        Windows Server 2012R2 or Windows Server 2016 with the Hyper-V Role installed and configured to host VMs. 

-        Enough capacity to host four VMs with the configuration requirements noted below.

-        Hyper-V servers can be part of a host failover cluster, or stand-alone. 

-        VMs can be located on the same server, or distributed across servers (as long as the networking connectivity allows for traffic to be routed to all VMs with as much throughput and lowest latency possible.)

Note:  These instructions and guidance focus on using our latest Windows Servers as the hypervisor.  Windows Server 2012R2 and Widows Server 2016
(pre-release) is what I use.  There is nothing that will restrict you to trying this with other private or public clouds.  However, this blog post does not cover those scenarios and whether or not they work will depend on the environment providing the necessary storage/network and other resources.  We will update our documentation as we verify for other private or public clouds.

Overview of Storage Spaces Direct

S2D uses disks that are exclusively connected to one node of a Windows Server 2016 failover cluster and allows Storage Spaces to create pools using those disks. Virtual Disks (Spaces) that are configured on the pool will have their redundant data (mirrors or parity) spread across the nodes of the cluster.  This allows access to data even when a node fails, or is shutdown for maintenance.

You can implement S2D implement in VMs, with each VM configured with two or more virtual disks connected to the VM’s SCSI Controller.  Each node of the cluster running inside of the VM will be able to connect to its own disks, but S2D will allow all the disks to be used in Storage Pools that span the cluster nodes.

S2D uses SMB as the transport to send redundant data, for the mirror or parity spaces, to be distributed across the nodes.

Effectively, this emulates the configuration in the following diagram:

Configuration 1: Single Hyper-V Server (or Client)

The simplest configuration is one machine hosting all of the VMs used for the S2D system.  In my case, a Windows Server 2016 Technical Preview 2 (TP2) system running on a desktop class machine with 16GB or RAM and a 4 core modern processor.

The VMs are configured identically. I have a virtual switch connected to the host’s network and goes out to the world for clients to connect and I created a second virtual switch that is set for Internal network, to provide another network path for S2D to utilize between the VMs.

The configuration looks like the following diagram:

Hyper-V Host Configuration

-         Configure the virtual switches: Configure a virtual switch connected to the machine’s physical NIC, and another virtual switch configured for internal only.

Example: Two virtual switches. One configured to allow network traffic out to the world, which I labeled “Public”.  The other is configured to only allow network traffic between VMs configured on the same host, which I labeled “InternalOnly”.

 

VM Configuration

-         Create four or more Virtual Machines

  • Memory: If using Dynamic Memory, the default of 1024 Startup RAM will be sufficient.  If using Fixed Memory you should configure 4GB or more.
  • Network:  Configure each two network adapters.  One connected to the virtual switch with external connection, the other network adapter connected to the virtual switch that is configured for internal only.
    • It’s always recommended to have more than one network, each connected to separate virtual switches so that if one stops flowing network traffic, the other(s) can be used and allow the cluster and Storage Spaces Direct system to remain running. 
  • Virtual Disks: Each VM needs a virtual disk that is used as a boot/system disk, and two or more virtual disks to be used for Storage Spaces Direct.
    • Disks used for Storage Spaces Direct must be connected to the VMs virtual SCSI Controller.
    • Like all other systems, the boot/system disk needs to have unique SIDs, meaning they need to be installed from ISO or other install methods, and if using duplicated VHDx it needs to be generalized (for example using Sysprep.exe), before the copy was made.
    • VHDx type and size:  You need at least eight VHDx files (four VMs with two data VHDx each).  The data disks can be either “dynamically expanding” or “fixed size”.  If you use fixed size, then set the size to 8GB or more, then calculate the size the combined VHDx files so that you don’t exceed the storage available on your system.

Example:  The following is the Settings dialog for a VM that is configured to be part of an S2D system on one of my Hyper-V hosts.  It’s booting from the Windows Server TP2 VHD that I downloaded from Microsoft’s external download site, and that is connected to the IDE Controller 0 (this had to be a Gen1 VM since the TP2 file
that I downloaded is a VHD and not VHDx). I created two VHDx files to be used by S2D, and they are connected to the SCSI Controller.  Also note the VM is connected to the Public and InternalOnly virtual switches.

 

 

Note: Do not enable the virtual machine’s Processor Compatibility setting.  This setting disables certain processor capabilities that S2D requires inside the VM. This option is unchecked by default, and needs to stay that way.  You can see this setting here:

 

Guest Cluster Configuration

Once the VMs are configured, creating and managing the S2D system inside the VMs is almost identical to the steps for supported physical hardware:

  1. Start the VMs
  2. Configure the Storage Spaces Direct system, using the “Installation and Configuration” section of the guide linked here: Storage Spaces Direct Experience and Installation Guide
    1. Since this in VMs using only VHDx files as its storage, there is no SSD or other faster media to allow tiers.  Therefore, skip the steps that enables or configures tiers.

Configuration 2: Two or more Hyper-V Servers

You may not have a single machine with enough resources to host all four VMs, or you may already have a Hyper-V host cluster to deploy on, or more than one Hyper-V servers that you want to spread the VMs across.  Here is an diagram showing a configuration spread across two nodes, as an example:

 

This configuration is very similar to the single host configuration.  The differences are:

 

Hyper-V Host Configuration

-         Virtual Switches:  Each host is recommended to have a minimum of two virtual switches for the VMs to use.  They need to be connected externally to different NICs on the systems.  One can be on a network that is routed to the world for client access, and the other can be on a network that is not externally routed.  Or, they both can be on externally routed networks.  You can choose to use a single network, but then it will have all the client traffic and S2D traffic taking common bandwidth, and there is no redundancy if the single network goes down for the system S2D VMs to stay connected. However, since this is for testing and verification of S2D, you don’t have the resiliency to network loss requirements that we strongly suggest for production deployments.

Example:  On this system I have an internal 10/100 Intel NIC and a dual port Pro/1000 1gb card. All Three NICs have virtual switches. I labeled one Public and connected it to the 10/100 NIC since my connection to the rest of the world is through a 100mb infrastructure.  I then have the 1gb NICs connected to a 1gb desktop switch (two different switches), and that provides my hosts two network paths between each other for S2D to use. As noted, three networks is not a requirement, but I have this available on my hosts so I use them all.

VM Configuration

-         Network:If you choose to have a single network, then each VM will only have one network adapter in its configuration.

Example: Below is a snip of a VM configuration on my two host configuration. You will note the following:

-         Memory:  I have this configured with 4GB of RAM instead of dynamic memory.  It was a choice since I have enough memory resources on my nodes to dedicate memory.

-         Boot Disk:  The boot disk is a VHDx, so I was able to use a Gen2 VM. 

-         Data Disks: I chose to configure four data disks per VM.  The minimum is two, I wanted to try four. All VHDx are configured on the SCSI Controller (which you don’t have a choice in Gen2 VMs).

-         Network Adapters:  I have three adapters, each connected to one of the three virtual switches on the host to utilize the available network bandwidth that my hosts provide.

General Suggestions:

-         Network.  Since the network between the VMs transports the redundant data for mirror and parity spaces, the bandwidth and latency of the network will be a significant factor in the performance of the system.  Keep this in mind as you experience the system in the test configurations.

-         VHDx location optimization.  If you have a Storage Space that is configured for a three way mirror, then the writes will be going to three separate disks (implemented as VHDx files on the hosts), each on different nodes of the cluster. Distributing the VHDx files across disks on the Hyper-V hosts will provide better response to the I/Os.  For instance, if you have four disks or CSV volumes available on the Hyper-V hosts, and four VMs, then put the VHDx files for each VM on a separate disks (VM1 using CSV Volume 1, VM2 using CSV Volume 2, etc). 

FAQ:

How does this differ from what I can do in VMs with Shared VHDx?

Shared VHDx remains a valid and recommended solution to provide shared storage to a guest cluster (cluster running inside of VMs).  It allows a VHDx to be accessed by multiple VMs at the same time in order to provide clustered shared storage.  If any nodes (VMs) fail, the others have access to the VHDx and the clustered roles using the storage in the VMs can continue to access their data.

S2D allows clustered roles access to clustered storage spaces inside of the VMs without provisioning shared VHDx on the host.  With S2D, you can provision VMs with a boot/system disk and then two or more extra VHDx files configured for each VM.  You then create a cluster inside of the VMs, configure S2D and have resilient clustered Storage Spaces to use for your clustered roles inside the VMs. 

References

Storage Spaces Direct Experience and Installation Guide

 

Microsoft Virtual Academy – Learn Failover Clustering & Hyper-V

$
0
0

Would you like to learn how to deploy, manage, and optimize a Windows Server 2012 R2 failover cluster?  The Microsoft Virtual Academy is a free training website for IT Pros with over 2.7 million students.  This technical course can teach you everything you want to know about Failover Clustering and Hyper-V high-availability and disaster recovery, and you don’t even need prior clustering experience!  Start today: http://www.microsoftvirtualacademy.com/training-courses/failover-clustering-in-windows-server-2012-r2.

Join clustering experts Symon Perriman (VP at 5nine Software and former Microsoft Technical Evangelist) and Elden Christensen (Principal Program Manager Lead for Microsoft’s high-availability team) and to explore the basic requirements for a failover cluster and how to deploy and validate it. Learn how to optimize the networking and storage configuration, and create a Scale-Out File Server. Hear the best practices for configuring and optimizing highly available Hyper-V virtual machines (VMs), and explore disaster recovery solutions with both Hyper-V Replica and multi-site clustering. Next look at advanced administration and troubleshooting techniques, then learn how System Center 2012 R2 can be used for large-scale failover cluster management and optimization.

This full day of training includes the following modules:

  1. Introduction to Failover Clustering
  2. Cluster Deployment and Upgrades
  3. Cluster Networking
  4. Cluster Storage & Scale-Out File Server
  5. Hyper-V Clustering
  6. Multi-Site Clustering & Scale-Out File Server
  7. Advanced Cluster Administration & Troubleshooting
  8. Managing Clusters with System Center 2012 R2

Learn everything you need to know about Failover Clustering on the Microsoft Virtual Academy: http://www.microsoftvirtualacademy.com/training-courses/failover-clustering-in-windows-server-2012-r2

Virtual Machine Compute Resiliency in Windows Server 2016

$
0
0

In today’s cloud scale environments, commonly comprising of commodity hardware, transient failures have become more common than hard failures. In these circumstances, reacting aggressively to handle these transient failures can cause more downtime than it prevents. Windows Server 2016, therefore introduces increased Virtual Machine (VM) resiliency to intra-cluster communication failures in your compute cluster.

Interesting Transient Failure Scenarios

The following are some potentially transient scenarios where it would be beneficial for your VM to be more resilient to intra-cluster communication failures:  

  • Node disconnected: The cluster service attempts to connect to all active nodes. The disconnected (Isolated) node cannot talk to any node in an active cluster membership.
  • Cluster Service crash: The Cluster Service on a node is down. The node is not communicating with any other node.
  • Asymmetric disconnect: The Cluster Service is attempting to connect to all active nodes. The isolated node can talk to at least one node in active cluster membership.

New Failover Clustering States

In Windows Server 2016, to reflect the new Failover Cluster workflow-in the event of transient failures, three new states have been introduced:

  • A new VM state, Unmonitored, has been introduced in Failover Cluster Manager to reflect a VM that is no longer monitored by the cluster service.

  • Two new cluster node states have been introduced to reflect nodes which are not in active membership but were host to VM role(s) before being removed from active membership: 
    • Isolated:
      • The node is no longer in an active membership
      • The node continues to host the VM role

    • Quarantine:
      • The node is no longer allowed to join the cluster for a fixed time period (default: 2 hours)­
      • This action prevents flapping nodes from negatively impacting other nodes and the overall cluster health
      • By default, a node is quarantined, if it ungracefully leaves the cluster, three times within an hour
      • VMs hosted by the node are gracefully drained once quarantined
      • No more than 20% of nodes can be quarantined at any given time

    • The node can be brought out of quarantine by running the Failover Clustering PowerShell© cmdlet, Start-ClusterNode with the–CQ or –ClearQuarantine flag.

VM Compute Resiliency Workflow in Windows Server 2016

The VM resiliency workflow in a compute cluster is as follows:

  • In the event of a “transient” intra-cluster communication failure, on a node hosting VMs, the node is placed into an Isolated state and removed from its active cluster membership. The VM on the node is now considered to be in an Unmonitored state by the cluster service.
    • File Storage backed (SMB): The VM continues to run in the Online state.
    • Block Storage backed (FC / FCoE / iSCSI / SAS): The VM is placed in the Paused Critical state. This is because the isolated node no longer has access to the Cluster Shared Volumes in the cluster.
    • Note that you can monitor the “true” state of the VM using the same tools as you would for a stand-alone VM (such as Hyper-V Manager).

  • If the isolated node continues to experience intra-cluster communication failures, after a certain period (default of 4 minutes), the VM is failed over to a suitable node in the cluster, and the node is now moved to a Down state.
  • If a node is isolated 3 times within an hour, it is placed into a Quarantine state for a certain period (default two hours) and all the VMs from the node are drained to a suitable node in the cluster.

Configuring Node Isolation and Quarantine settings

To achieve the desired Service Level Agreement guarantees for your environment, you can configure the following cluster settings, controlling how your node is placed in isolation or quarantine:

 

Setting

 

Description

Default

Values

ResiliencyLevel

Defines how unknown failures handled

2

1 – Allow the node to be in Isolated
  state only if the node gave a notification and it went away for known reason, otherwise fail immediately. Known reasons include Cluster Service crash or Asymmetric Connectivity between nodes.

2- Always let a node go to anIsolated state and give it time before taking over ownership of the VMs.

PowerShell:

(Get-Cluster).ResiliencyLevel = <value>

ResiliencyPeriod

Duration to allow VM to run isolated (in seconds)

240

 0 – Reverts to pre-Windows Server 2016 behavior

 

 PowerShell:

 Cluster property:

(Get-Cluster).ResiliencyDefaultPeriod = <value>

 Resource property for granular control:

(Get-ClusterGroup “My VM”).ResiliencyPeriod= <value>

A value of -1 for the resource property causes the cluster property to be used.

 

QuarantineDuration

Duration to disallow cluster node join (in seconds)

7200

0xFFFFFFFF – Never allow node to join (in seconds)

 

PowerShell:

(Get-Cluster).QuarantineDuration = <value>

Cluster Shared Volume - A Systematic Approach to Finding Bottlenecks

$
0
0

In this post we will discuss how to find if performance that you observe on a Cluster Shared Volume (CSV) is what you expect and how to find which layer in your solution may be the bottleneck. This blog assumes you have read the previous blogs in the CSV series (see the bottom of this blog for links to all the blogs in the series). 

Sometimes someone asks a question in why CSV performance does not match their expectations and how to investigate. The answer is that CSV consists of multiple layers, and the most straight forward troubleshooting approach is through a process of elimination to first remove all the layers, test speed of the disk and then start adding layers one by one until you find the one causing the issue.

You might be tempted to use copy file as a quick way to test performance. While copy file is an important workload it is not the best way to test your storage performance. Review this blog which goes into more details why it does not work well http://blogs.technet.com/b/josebda/archive/2014/08/18/using-file-copy-to-measure-storage-performance-why-it-s-not-a-good-idea-and-what-you-should-do-instead.aspx. It is important to understand copy file performance that you can expect from your storage so I would suggest to run copy file after you are done with micro benchmarks as a part of workload testing.

To test performance you can use DiskSpd that is described in this blog post http://blogs.technet.com/b/josebda/archive/2014/10/13/diskspd-powershell-and-storage-performance-measuring-iops-throughput-and-latency-for-both-local-disks-and-smb-file-shares.aspx.

When selecting file size you will run the tests on be aware of the caches and tiers on your storage. For instance a storage might have cache on NVRAM or NVME. All writes that go to fast tier might be very fast, but then once you used up all the space on the cache you will have to go with the speed of the next slower tier. If your intention is to test cache then create a file that fits into the cache, otherwise create file that is larger than the cache.

Some LUNs might have some offsets mapped to SSDs while others map to HDDs. An example would be tiered space. When creating a file be aware what tier the blocks of the files are located on.

Additionally, when measuring performance do not assume that if you’ve created two LUNs with the similar characteristics you will get identical performance. If the LUN’s are not laid out on the physical spindles in a different way it might be enough to cause completely different performance behavior. To avoid surprises as you are running tests through different layers (will be described below) ALWAYS use the same LUN. Several times we’ve seen cases when someone would run tests against one LUN, and then would run tests over CSVFS with another, with what was believed to be a similar LUN. Only to observe worse results in CSVFS case and would incorrectly come to a conclusion that CSVFS is the problem. When in the end, removing disk from CSV and running test directly on the LUN was showing that two LUNs have different performance.

Sample number you will see in this post were collected on a 2 Node Cluster,

CPU: Intel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, Intel64
Family 6 Model 45 Stepping 7, GenuineIntel,
2 NUMA nodes 8 Cores each with Hyperthreading disabled.
RAM: 32 GB DDR3.
Network: one RDMA Mellanox ConnectX-3 IPoIB Adapter
54GBiPS, and one Intel(R) I350 Gigabit network adapter.
The shared disk is a single HDD connected using SAS.
Model HP EG0300FBLSE Firmware version HPD6. Disk cache is disabled.

 
 
  
With this hardware my expectation is that the disk should be the bottleneck, and going over the network should not have any impact on throughput.

In the samples you will see below I was running a single threaded test application, which at any time was keeping eight 8K outstanding IOs on the disk. In your tests you might want to add more variations with different queue depth and different IO sizes, and different number of threads/CPU cores utilized. To help, I have provided the table below which outlines some tests to run and data to capture to get a more exhaustive picture of your disk performance. Running all these variation may take several hours. If you know IO patterns of your workloads then you can significantly reduce the test matrix.

 

 

 

Queue Depth

 

 

 

1

4

16

32

64

128

256

Unbuffered Write-Trough

4K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

8K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

16K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

64K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

128K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

256K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

512K

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

1MB

sequential read

 

 

 

 

 

 

 

sequential write

 

 

 

 

 

 

 

random read

 

 

 

 

 

 

 

random write

 

 

 

 

 

 

 

random 70% reads 30 % writes

 

 

 

 

 

 

 

If you have Storage Spaces then it might be useful to first collect performance numbers of the individual disks this Space will be created with. This will help set expectations around what kind of performance you should expect in best/worst case scenario from the Space.

As you are testing individual spindles that will be used to build Storage Spaces pay attention to different MPIO (Multi Path IO) modes. For instance you might expect that round robin over multiple paths would be faster than fail over, but for some HDDs you might find that they give you better throughput with fail over than with round robin. When it comes to SAN MPIO considerations are different. In case of SAN, MPIO is between the computer and a controller in the SAN storage box. In case of Storage Spaces MPIO is between computer and the HDD, so it comes to how efficient is the HDD’s firmware handling IO from different paths. In production for a JBOD connected to multiple computers IO will be coming from different computers so in any case HDD firmware need to be able to efficiently handle IOs coming from multiple computers/paths. Like with any kind of performance testing you should not jump to a conclusion that a particular MPIO mode is good or bad, always test first.

Another commonly discussed topic is what should be the file system allocation unit size (A.K.A cluster size). There is a variety of options between 4K and 64K.

 

As with anything, there is no silver bullet. Ideally for any particular storage you need to run tests with all block sizes and pick the one that gives you the best throughput. You also can have a discussion with your storage vendor for their recommendations. Here we will go over several considerations.

  1. File system fragmentation. If for the moment we forget about the storage underneath the file system aside and look only at the file system layer by itself then
    1. Smaller blocks mean better space utilization on the disk because if your file is only 1K then with 64K cluster size this file will consume 64K on the disk while with 4K cluster size it will consume only 4K, and you can have (64/4) 16 1K files on 64K. If you have lots of small files then small cluster size might be a good choice.
    2. On the other hand if you have large files that are growing then smaller cluster size means more fragmentation. For instance in worst case scenario a 1 GB file with 4K cluster might have up to (1024x1024/4) 262,144 fragments (A.K.A runs)  while with 64K clusters it will have only (1024x1024/64) 16,384 fragments. So why does fragmentation matter?
      1. If you are constrained on RAM you may care more, as more fragments means more RAM needed to track all these metadata.
      2. If your workload generates IO larger than the cluster size, and if your do not run defrag frequent enough, and consequently have lots of fragments then workloads IO might need to get split more often when cluster size is smaller. For instance if on average workload generates a 32K IO then in worst case scenario on 4K cluster size this IO might need to be split to (32/4) 8 4K IOs to the volume, while with 64K cluster size it would never get split. Why splitting matters? Usually when it comes to a production workload it will be close to random IO, but larger the blocks are larger throughput you will see on average so ideally we should try to avoid splitting IO if this is not necessary. 
      3. If you are using storage copy offload then, some storage boxes support it only at a 64K granularity and would fail if cluster size is smaller. You need to check with your storage vendor.
      4. If you anticipate lots of large file level trim commands (this is file system counterpart of storage block UNMAP). You might care about trim if you are using thinly provisioned LUN or if you have SSDs. SSDs garbage collection logic in firmware benefits from knowing certain blocks are not being used by a workload and can be used for garbage collection. For example let’s assume we have a VHDX with NTFS inside, and this VHDX file itself is very fragmented. When you run defrag on NTFS inside the VHDX (most likely inside VM) then among other steps defrag will do free space consolidation, and then it will issue a file level trim to retrim the free blocks. If there are lots of free space this might be a trim for a very large block. This trim will come to NTFS that hosts the VHDX. Then NTFS will need to translate this large file trim to block unmap for each fragment of the file. If the file highly fragmented then it may take a significant amount of time. A similar scenario might happen when you delete a large file or lots of files at once.
      5. The list above is not exhaustive by any means, I am focusing on what I view as the more relevant
      6. From the File System perspective the rule of thumb would be to prefer larger cluster size unless you are planning to have lots of tiny files, and disk space saving from the smaller cluster size is important. No matter what cluster size you choose you will be better off periodically running defrag. You can monitor how much fragmentation is affecting your workload by looking at CSV File system Split IO, and PhysicalDisk Split IO performance counters.
  2. File system block alignment and storage block alignment. When you create a LUN on a SAN or Storage Space it may be created out of multiple disks with different performance characteristics. For instance a mirrored spaces (http://blogs.msdn.com/b/b8/archive/2012/01/05/virtualizing-storage-for-scale-resiliency-and-efficiency.aspx ) would contain  slabs on many disks, some slabs will be acting as mirrors, and then the entire space address range will be subdivided in 64K blocks and round robin across these slabs on different disks in RAID0 fashion to give you better aggregated  throughput of multiple spindles.

 

This means that if you have 128K IO it will have to be split to 2 64K IOs that will go to different spindles. What if your File system is formatted with cluster size smaller than 64K? That means continues block in file system might not be 64K aligned. For example if FS is formatted with 4K clusters, and we have a file that is 128K, then my file can start at 4K alignment. If my application performs 128K read then it is possible this 128K block will map to up to 3 64 blocks on the storage spaces.

 

If your format your file system with 64K cluster size then file allocations always 64K aligned and on average you will see less IOPS on the spindles.  Performance difference will be even larger when it comes to writes to Parity, RAID5 or RAID6 like LUNs. When you are overwriting part of the block storage have to do read-modify-write multiplying number of IOPS that is hitting your spindles. If you overwriting the entire block then it will be exactly one IO.  If you want to be accurate then you need to evaluate what is the average block size you expect your workload to produce. If it is larger than 4K then you want FS cluster size to be at least as large your average IO size so on average it would not get split at the storage layer.  A rule of thumb might be to simply use the same cluster size as block size used by the storage layer.  My recommendation is to always consult your storage vendor for advice.  Modern storage arrays have very sophisticated tiering and load balancing logic and unless you understand everything about how your storage box works you might end up with unexpected results. Alternatively you can run variety of performance tests with different cluster sizes and see which one gives you better results. If you do not have time to do that then I recommend opting in for a larger block size is a safer bet.

Performance of HDD/SSD might change after updating disk or storage box firmware so it might save you time if you rerun performance tests after update.

As you are running the tests you can use performance counters described here http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx to get further insights into behavior of each layer by monitoring average queue depth, latency, throughput and IOPS at CSV, SMB and Physical Disk layers. For instance if your disk is bottleneck then latency, and queue depth at all of these layers will be the same. Once you see queue depth and latency at the higher level is above what you see on the disk that means this layer might be the bottleneck.

 

Run performance tests only on the hardware that is currently not used by any other workloads/tests otherwise your results may not be valid because of too much variability. You also might want to rerun each variation several times to make sure there is no variability.

Baseline 1 – No CSV; Measure Performance of NTFS

In this case IO has to traverse the NTFS file system and disk stack in the OS, so conceptually we can represent it this way:

 

For most disks, expectations are that sequential read >= sequential write >= random read >= random write. For an SSD you may observe no difference between random and sequential while for HDD the difference may be significant. Differences between read and write will vary from disk to disk.

As you are running this test keep an eye out if you are saturating CPU. This might happen when your disk is very fast. For instance if you are using Simple Space backed by 40 SSDs.

Run baseline tests multiple times. If you see variance at this level then most likely it is coming from the disk and it will be affecting other tests as well.  Below you can see the number I’ve collected on my hardware, the results match expectations.

 

Queue Depth

8

Unbuffered Write-Trough

8K

sequential read

IOPS

19906

MB/Sec

155

sequential write

IOPS

17311

MB/Sec

135

random read

IOPS

359

MB/Sec

2

random write

IOPS

273

MB/Sec

2

 

Baseline 2 - No CSV; Measure SMB Performance between Cluster Nodes

To run this test online clustered disk on one cluster node.
Assign it a drive letter - for example K:. Run test from another node over SMB using an admin share. For instance your path might look like this \\Node1\K$. In this case IO have to go over following layers

 

You need to be aware of SMB multichannel and make sure that you are using only the NICs that you expect cluster to use for intra-node traffic. You can read more about SMB multichannel in clustered environment in
this blog post http://blogs.msdn.com/b/emberger/archive/2014/09/15/force-network-traffic-through-a-specific-nic-with-smb-multichannel.aspx

If you have RDMA network or when your disk is slower than what SMB can pump through all channels, and you have sufficiently large queue depth then you might see Baseline 2 close or even equal to Baseline 1. That means your bottleneck is disk, and not network.

Run the baseline test several times. If you see variance at this level then most likely it is coming from the disk or network and it will be affecting other tests as well. Assuming you’ve already sorted out variance that is coming from the disk while you were collecting Baseline 1, now you should focus on variance that is causing by network.

Here are the numbers I’ve collected on my hardware. To make it easier for you to compare I am repeating Baseline 1 numbers here.

 

Queue Depth

Baseline 1

8

 

Unbuffered Write-Trough

8K

sequential read

IOPS

19821

19906

MB/Sec

154

155

sequential write

IOPS

810

17311

MB/Sec

6

135

random read

IOPS

353

359

MB/Sec

2

2

random write

IOPS

272

273

MB/Sec

2

2

 

In my case I have verified that IO is going over RDMA and network indeed almost does not add latency, but there is a difference in IOPS between sequential write with Baseline 1 which seems odd. First I’ve looked at performance counters:

Physical disk performance counters for Baseline 1

 

Physical disk and SMB Server Share performance counters for Baseline 2

 

SMB Client Share and SMB Direct Connection performance counters for Baseline 2

 

Observe that in both cases PhysicalDisk\Avg.Disk Queue Length is the same. That tells us SMB does not queue IO, and disk has all the pending IOs all the time. Second observe that PhysicalDisk\Avg.Disk sec/Transfer in Baseline 1 is 0 while in Baseline 2 is 10 milliseconds. Huh!
This tells me that the disk got slower because requests came over SMB!?

Next step was to record a trace using Windows Performance Toolkit (http://msdn.microsoft.com/en-us/library/windows/hardware/hh162962.aspx ) with Disk IO for both Baseline 1 and Baseline 2. Looking at the traces I’ve noticed the Disk Service time for some reason got longer for Baseline 2! Then I also noticed that when requests were coming from SMB they hit disk from 2 threads while using my test utility all requests were issued from single thread. Remember that we are investigating sequential write. Even though when running over SMB test is issuing all writes from one thread in sequential order, SMB on the server was dispatching these writes to the disk using 2
threads and sometimes writes would get reordered. Consequently IOPS I am getting for sequential write are close to random write. To verify that I reran test for Baseline 1 with 2 threads, and bingo! I’ve got matching numbers.

Here is what you would see in WPA for IO over SMB.

 

Average disk service time is about 8.1 milliseconds, and IO time is about 9.6 milliseconds. The green and violate colors match to IO issued by different threads. If you look close, expand table, remove thread Id from grouping and sort by Init Time you can see how IO are interleaving and Min Offset is not strictly sequential:

 

 While without SMB all IOs came on one thread, disk service time is about 600 microseconds, and IO time is about 4 milliseconds

 

If you expand and sort by Init Time you will see Min Offset is strictly increasing

 

In production in most of the cases you will have workload that is close to random IO, and sequential IO is only giving you a theoretical best case scenario.

Next interesting question is why we do not see similar degradation for sequential read. The theory is that in case of read disk might be reading the entire track and keeping it in the cache so even when reads are rearranged the track is already in the cache and reads on average stay not affected. Since I disabled disk cache for writes, they always have to hit spindle and more often would pay seek cost.

Baseline 3 - No CSV; Measure SMB Performance between Compute Nodes and Cluster Nodes

If you are planning to run workload and storage on the same set of nodes then you can skip this step. If you are planning to disaggregate workload and storage and access storage using a Scale Out File Server (SoFS) then you should run the same test as Baseline 2, just in this case select a compute node as a client, and make sure that over network you are using the NICs that will be used to handle compute to storage traffic once you create the cluster.

Remember that for reliability reasons files over SOFS are always opened with write-through so we would suggest to always add write-through to your tests. As an option you can create a classing singleton (non SOFS) file server over a clustered disk, create a Continuously Available share on that file server and run your test there. It will make sure traffic will go only over networks marked in the cluster as public, and because this is a CA share all opens will be write-through.

Layers diagram and performance considerations in this case is exactly the same as in case of Baseline 2.

CSVFS Case 1 - CSV Direct IO

Now add disk to CSVFS.

You can run same test on coordinating node and non-coordinating node and you should see the same results. Numbers should match to the Baseline 1. The length of the code path is the same, just instead of NTFS you will have CSVFS. Following diagram represents the layers IO will be going through

 

Here are the number I’ve collected on my hardware, to make it easier for you to compare I am repeating Baseline 1 numbers here. 

On coordinating node:

 

 

Queue Depth

Baseline 1

 

8

 

Unbuffered Write-Trough

8K

sequential read

IOPS

19808

19906

MB/Sec

154

155

sequential write

IOPS

17590

17311

MB/Sec

137

135

random read

IOPS

356

359

MB/Sec

2

2

random write

IOPS

273

273

MB/Sec

2

2

 

On non-coordinating node

 

Queue Depth

Baseline 1

8

 

Unbuffered Write-Trough

8K

sequential read

IOPS

19793

19906

MB/Sec

154

155

sequential write

IOPS

177880

17311

MB/Sec

138

135

random read

IOPS

359

359

MB/Sec

2

2

random write

IOPS

273

273

MB/Sec

2

2

 

CSVFS Case 2 - CSV File System Redirected IO on Coordinating Node

In this case we are not traversing network, but we do traverse 2 file systems.  If you are disk bound you should see numbers matching Baseline 1.  If you have very fast storage and you are CPU bound then you will saturate CPU a bit faster and will be about 5-10% below Baseline 1.

 

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

 

Queue Depth

Baseline 1

Baseline 2

8

 

 

Unbuffered Write-Trough

8K

sequential read

IOPS

19807

19906

19821

MB/Sec

154

155

154

sequential write

IOPS

5670

17311

810

MB/Sec

44

135

6

random read

IOPS

354

359

353

MB/Sec

2

2

2

random write

IOPS

271

273

272

MB/Sec

2

2

2

 

Looks like some IO reordering is happening in this case too so you can see sequential write numbers are somewhere between Baseline 1 and Baseline 2. All other number perfectly lines up with expectations.

CSVFS Case 3 - CSV File System Redirected IO on Non-Coordinating Node

You can put CSV in file system redirected mode using cluster UI

 

Or using PowerShell cmdlet Suspend-ClusterResource with parameter –RedirectedAccess.

This is the longest IO path where we are not only traversing 2 file systems, but also going over SMB and network.  If you are network bound then you should see your numbers are close to Baseline 2.  If your network is very fast and your bottleneck is storage then numbers will be close to Baseline 1.  If storage is also very fast and you are CPU bound then numbers should be 10-15% below Baseline 1.

 

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

 

 

 

 

Queue Depth

Baseline 1

Baseline 2

 

 

 

 

8

 

 

Unbuffered Write-Trough

8K

sequential read

IOPS

19793

19906

19821

MB/Sec

154

155

154

sequential write

IOPS

835

17311

810

MB/Sec

6

135

6

random read

IOPS

352

359

353

MB/Sec

2

2

2

random write

IOPS

273

273

272

MB/Sec

2

2

2

 

In my case numbers are matching Baseline 2, and in all cases, except sequential write are close to Baseline 1.

CSVFS Case 4 - CSV Block Redirected IO on Non-Coordinating Node

If you have SAN then you can play with LUN masking to hide this LUN from the node where you will run this test. If you are using Storage Spaces then Mirrored Space is always attached only on the Coordinator node and any non-coordinator node will be in block redirected mode as long as you do not have tiering heatmap enabled on this volume. See this blog post for more details http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx on how Storage Spaces tiering affects CSV IO mode.

Please note that CSV never uses Block Redirected IO on Coordinator node. Since on the coordinator node disk is always attached CSV will always use Direct IO. So remember to run this test on non-coordinating node.  If you are network bound then you should see your numbers are close to Baseline 2.  If your network is very fast and your bottleneck is storage then numbers will be close to Baseline 1.  If storage is also very fast and you are CPU bound then numbers should be about 10-15% below Baseline 1.

 

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

 

 

 

 

Queue Depth

Baseline 1

Baseline 2

 

 

 

 

8

 

 

Unbuffered Write-Trough

8K

sequential read

IOPS

19773

19906

19821

MB/Sec

154

155

154

sequential write

IOPS

820

17311

810

MB/Sec

6

135

6

random read

IOPS

352

359

353

MB/Sec

2

2

2

random write

IOPS

274

273

272

MB/Sec

2

2

2

 

In my case numbers match to the Baseline 2 and are very close to Baseline 1.

Scale-out File Server (SoFS)

To test Scale-out File Server you need to create the SOFS resource using Failover Cluster Manager or PowerShell, and add a share that maps to the same CSV volume that you have been using for the tests so far. Now your baselines will be CSVFS cases. In case of SOFS SMB will deliver IO to CSVFS on coordinating or non-coordinating node (depending where the client is connected; you use PowerShell Get-SMBWitnessClient to learn client connectivity), and then it will be up to CSVFS to deliver IO to the disk. The path that CSVFS will take is predictable, but depends on nature of your storage and current connectivity. You will need to select baseline between CSV Case 1 – 4.

If you see numbers are similar to CSV baseline then you know that SMB above CSV is not adding overhead and you can look at numbers collected for the CSV baseline to detect where the bottleneck is.  If you see numbers are lower comparing to CSV baseline then your client network is the bottleneck, and you should validate that it matches difference between Baseline 3 and Baseline 1.

 

Summary

In this blog post we looked at how to tell if CSVFS performance for reads and writes is at expected levels. You can achieve that by running performance tests before and after adding disk to CSV. You will use ‘before’ numbers as your baseline. Then add disk to CSV and test different IO dispatch modes. Compare observed numbers to the baselines to learn what layer is your bottleneck.

Thanks!
Vladimir Petter
Principal Software Engineer
High-Availability & Storage
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx
 
Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx

Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142
http://blogs.msdn.com/b/clustering/archive/2015/03/26/10603160.aspx

Workgroup and Multi-domain clusters in Windows Server 2016

$
0
0

In Windows Server 2012 R2 and previous versions, a cluster could only be created between member nodes joined to the same domain. Windows Server 2016 breaks down these barriers and introduces the ability to create a Failover Cluster without Active Directory dependencies. Failover Clusters can now therefore be created in the following configurations:

  • Single-domain Clusters: Clusters with all nodes joined to the same domain
  • Multi-domain Clusters: Clusters with nodes which are members of different domains
  • Workgroup Clusters: Clusters with nodes which are member servers / workgroup (not domain joined)

Pre-requisites

The prerequisites for Single-domain clusters are unchanged from previous versions of Windows Server.

In addition to the pre-requisites of Single-domain clusters, the following are the pre-requisites for Multi-domain or Workgroup clusters in the Windows Server 2016 Technical Preview 3 (TP3) release:

  • Management operations may only be performed using Microsoft PowerShell©. The Failover Cluster Manager snap-in tool is not supported in these configurations.
  • To create a new cluster (using the New-Cluster cmdlet) or to add nodes to the cluster (using the Add-ClusterNode cmdlet), a local account needs to be provisioned on all nodes of the cluster (as well as the node from which the operation is invoked) with the following requirements:
    1. Create a local ‘User’ account on each node in the cluster
    2. The username and password of the account must be the same on all nodes
    3. The account is a member of the local ‘Administrators’ group on each node
  • The Failover Cluster needs to be created as an Active Directory-Detached Cluster without any associated computer objects. Therefore, the cluster needs to have a Cluster Network Name (also known as administrative access point) of type DNS.
  • Each cluster node needs to have a primary DNS suffix.

Deployment

Workgroup and Multi-domain clusters maybe deployed using the following steps:

  1. Create consistent local user accounts on all nodes of the cluster. Ensure that the username and password of these accounts are same on all the nodes and add the account to the local Administrators group. 

 

 

      2.    Ensure that each node to be joined to the cluster has a primary DNS suffix.

 

      3.    Create a Cluster with the Workgroup nodes or nodes joined to different domains. When creating the cluster, use the AdministrativeAccessPoint switch to specify a type of DNS so that the cluster does not attempt to create computer objects.

 

New-Cluster –Name <Cluster Name> -Node <Nodes to Cluster> -AdministrativeAccessPoint DNS

Workload

The following table summarizes the workload support for Workgroup and Multi-site clusters.

 

 Cluster Workload

 Supported/Not Supported

 More Information

SQL Server

Supported

We recommend that you use SQL Server Authentication.

File Server

Supported, but not recommended

Kerboros (which is not available) authentication is the preferred authentication protocol for Server Message Block (SMB) traffic.

Hyper-V

Supported, but not recommended

Live migration is not supported. Quick migration is supported.

Message Queuing (MSMQ)

Not supported

Message Queuing stores properties in AD DS.

 

Quorum Configuration

The witness type recommended for Workgroup clusters and Multi-domain clusters is a Cloud Witness or Disk Witness.  File Share Witness (FSW) is not supported with a Workgroup or Multi-domain cluster.

Cluster Validation

Cluster Validation for Workgroup and Multi-domain clusters can be run using the Test-Cluster PowerShell cmdlet. Note the following for the Windows Server 2016 TP3 release:

  • The following tests will incorrectly generate an Error and can safely be ignored:
    • Cluster Configuration – Validate Resource Status
    • System Configuration – Validate Active Directory Configuration

Cluster Diagnostics

The Get-ClusterDiagnostics cmdlet is not supported on Workgroup and Multi-domain clusters in the Windows Server 2016 TP3 release. 

Servicing

It is recommended that nodes in a cluster have a consistent configuration.  Multi-domain and Workgroup clusters introduce higher risk of configuration drift, when deploying ensure that:

  • The same set of Windows patches are applied to all nodes in the clusters
  • If group policies are rolled out to the cluster nodes, they are not conflicting. 

DNS Replication

It should be ensured that the cluster node and network names for Workgroup and Multi-domain clusters are replicated to the DNS servers authoritative for the cluster nodes.

 

 

Site-aware Failover Clusters in Windows Server 2016

$
0
0

Windows Server 2016, debuts the birth of site-aware clusters. Nodes in stretched clusters can now be grouped based on their physical location (site). Cluster site-awareness enhances key operations during the cluster lifecycle such as failover behavior, placement policies, heartbeating between the nodes and quorum behavior. In the remainder of this blog I will explain how you can configure sites for your cluster, the notion of a “preferred site” and how site awareness manifests itself in your cluster operations.

Configuring Sites

A node’s site membership can be configured by setting the Site node property to a unique numerical value.

For example, in a four node cluster with nodes - Node1, Node2, Node3 and Node4, to assign the nodes to Sites 1 and Site 2, do the following:

  • Launch Microsoft PowerShell© as an Administrator and type:

(Get-ClusterNodeNode1).Site=1

(Get-ClusterNodeNode2).Site=1

(Get-ClusterNodeNode3).Site=2

(Get-ClusterNodeNode4).Site=2

Configuring sites enhances the operation of your cluster in the following ways:

Failover Affinity

  • Groups failover to a node within the same site, before failing to a node in a different site
  • During Node Drain VMs are moved first to a node within the same site before being moved cross site
  • The CSV load balancer will distribute within the same site

Storage Affinity

Virtual Machines (VMs) follow storage and are placed in same site where their associated storage resides. VMs will begin live migrating to the same site as their associated CSV after 1 minute of the storage being moved.

Cross-Site Heartbeating

You now have the ability to configure the thresholds for heartbeating between sites. These thresholds are controlled by the following new cluster properties:

 

Property

Default Value

Description

CrossSiteDelay

1

Frequency heartbeat sent to nodes on dissimilar sites

CrossSiteThreshold

20

Missed heartbeats before interface considered down to nodes on dissimilar sites

 

To configure the above properties launch PowerShell© as an Administrator and type:

(Get-Cluster).CrossSiteDelay = <value>

(Get-Cluster).CrossSiteThreshold = <value>

You can find more information on other properties controlling failover clustering heartbeating here.

The following rules define the applicability of the thresholds controlling heartbeating between two cluster nodes:

  • If the two cluster nodes are in two different sites and two different subnets, then the Cross-Site thresholds will override the Cross-Subnet thresholds.
  • If the two cluster nodes are in two different sites and the same subnets, then the Cross-Site thresholds will override the Same-Subnet thresholds.
  • If the two cluster nodes are in the same site and two different subnets, then the Cross-Subnet thresholds will be effective.
  • If the two cluster nodes are in the same site and the same subnets, then the Same-Subnet thresholds will be effective.

Configuring Preferred Site

In addition to configuring the site a cluster node belongs to, a “Preferred Site” can be configured for the cluster. The Preferred Site is a preference for placement. The Preferred Site will be your Primary datacenter site.

Before the Preferred Site can be configured, the site being chosen as the preferred site needs to be assigned to a set of cluster nodes. To configure the Preferred Site for a cluster, launch PowerShell© as an Administrator and type:

(Get-Cluster).PreferredSite = <Site assigned to a set of cluster nodes>

Configuring a Preferred Site for your cluster enhances operation in the following ways:

Cold Start

During a cold start VMs are placed in in the preferred site

Quorum

  • Dynamic Quorum drops weights from the Disaster Recovery site (DR site i.e. the site which is not designated as the Preferred Site) first to ensure that the Preferred Site survives if all things are equal. In addition, nodes are pruned from the DR site first, during regroup after events such as asymmetric network connectivity failures.
  • During a Quorum Split i.e. the even split of two datacenters with no witness, the Preferred Site is automatically elected to win
    • The nodes in the DR site drop out of cluster membership
    • This allows the cluster to survive a simultaneous 50% loss of votes
    • Note that the LowerQuorumPriorityNodeID property previously controlling this behavior is deprecated in Windows Server 2016

 

 

Preferred Site and Multi-master Datacenters

The Preferred Site can also be configured at the granularity of a cluster group i.e. a different preferred site can be configured for each group. This enables a datacenter to be active and preferred for specific groups/VMs.

To configure the Preferred Site for a cluster group, launch PowerShell© as an Administrator and type:

(Get-ClusterGroupGroupName).PreferredSite = <Site assigned to a set of cluster nodes>

 

Placement Priority

Groups in a cluster are placed based on the following site priority:

  1. Storage affinity site
  2. Group preferred site
  3. Cluster preferred site

Hyper-converged with Windows Server 2016

$
0
0

Windows Server 2016 Technical Preview 3 just recently released, and one of the big hot features which has me really excited is Storage Spaces Direct (S2D).  With S2D you will be able to create a hyper-converged private cloud.  A hyper-converged infrastructure (HCI) consolidates compute and storage into a common set of servers.  Leveraging internal storage which is replicated, you can create a true Software-defined Storage (SDS) solution.

This is available in the Windows Server 2016 Technical Preview today!  I encourage you to go try it out and give us some feedback.  Here's where you can learn more:

Presentation from Ignite 2015:

Storage Spaces Direct in Windows Server 2016 Technical Preview
https://channel9.msdn.com/events/Ignite/2015/BRK3474

Deployment guide:

Enabling Private Cloud Storage Using Servers with Local Disks

https://technet.microsoft.com/en-us/library/mt126109.aspx

Claus Joergensen's blog:

Storage Spaces Direct
http://blogs.technet.com/b/clausjor/archive/2015/05/14/storage-spaces-direct.aspx

Thanks!
Elden Christensen
Principal PM Manager
High-Availability & Storage
Microsoft

Configuring site awareness for your multi-active disaggregated datacenters

$
0
0

In a previous blog,I discussed the introduction of site-aware Failover Clusters in Windows Server 2016. In this blog, I am going to walk through how you can configure site-awareness for your multi-active disaggregated datacenters. You can learn more about Software Defined Storage and the advantages of a disaggregated datacenter here.

Consider the following multi-active datacenters, with a compute and a storage cluster, stretched across two datacenters. Each cluster has two nodes on each datacenter.

To configure site-awareness for the stretched compute and storage clusters proceed as follows: 

Compute Stretch Cluster

 1)     Assign the nodes in the cluster to one of the two datacenters (sites).

  • Open PowerShell© as an Administrator and type:

(Get-ClusterNode Node1).Site = 1

(Get-ClusterNode Node2).Site = 1

(Get-ClusterNode Node3).Site = 2

(Get-ClusterNode Node4).Site = 2

2)     Configure the site for your primary datacenter.

(Get-Cluster).PreferredSite = 1

Storage Stretch Cluster

In multi-active disaggregated datacenters, the storage stretch cluster hosts a Scale-Out File Server (SoFS). For optimal performance, it should be ensured that the site hosting the Cluster Shared Volumes comprising the SoFS, follows the site hosting the compute workload. This avoids the cost of cross-datacenter network traffic.

1)     As in the case of the compute cluster assign the nodes in the storage cluster to one of the two datacenters (sites).

(Get-ClusterNode Node5).Site = 1

(Get-ClusterNode Node6).Site = 1

(Get-ClusterNode Node7).Site = 2

(Get-ClusterNode Node8).Site = 2

2)     For each Cluster Shared Volume (CSV) in the cluster, configure the preferred site for the CSV group to be the same as the preferred site for the Compute Cluster.

$csv1 = Get-ClusterSharedVolume "Cluster Disk 1" | Get-ClusterGroup

($csv1).PreferredSite = 1 

3)  Set each CSV group in the cluster to automatically failback to the preferred site when it is available after a datacenter outage.

($csv1).AutoFailbackType = 1

Note: Step 2 and 3 can also be used to configure the Preferred Site for a CSV group in a hyper-converged data-center deployment. You can learn more about hyper-converged deployments in Windows Server 2016 here.

 

How can we improve the installation and patching of Windows Server? (Survey Request)

$
0
0

Do you want your server OS deployment and servicing to move faster? We're a team of Microsoft engineers who want your experiences and ideas around solving real problems of deploying and servicing your server OS infrastructure. We prefer that you don't love server OS deployment already, and we’re interested even if you don’t use Windows Server. We need to learn it and earn it.

Click the link below if you wish to fill out a brief survey and perhaps participate in a short phone call.

https://aka.ms/deployland

Many Thanks!!!

-Rob.

Viewing all 111 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>