Availability Sets in Azure
As part of a resiliency effort with one of my customers, we invested time in developing an Availability Set deployment model for their VMs that would ensure that regardless of the deployment size, a given workload process would continue to function when the hypervisor responsible for the VM was in a fault state or being updated. The challenge we encountered was how to determine whether a given region supported two or three fault domains. In this blog post, I describe why knowing how many fault domains was necessary, and how I determined an authoritative record of the number per region.
An Availability Set is a “logical grouping of VMs that allows Azure to understand how your application is built to provide for redundancy and availability.” When you create an AS in a given region, you’re effectively stating to Azure that the VMs inside of that grouping require orchestration as it relates to how they’re affected by a fault to the underlying compute, network, or storage (at the hypervisor level) or by an update to the hypervisor itself. When either of these events occur, it’s Azure’s responsibility based on the association of the VM to an Availability Set that determines how the VM is handled compared to its workload peer VMs. To handle each of these cases (faults and updates), Azure provides the ability to define how many fault and update domains an Availability Set can manage for a given region. A fault domain is a grouping of virtual machines that share a common power source and network switch. In addition, VMs are also aligned with disk fault domains which ensures that the managed disks associated to those VMs are located (effectively) on unique storage.
It’s this association that determines whether a given region can have an availability set with two or three fault domains when it’s instantiated.
As a Bash shell enthusiast, I figured I could whip up a script to dynamically generate the list by brute forcing the creation of an availability set against every region that my Azure account has access to with the fault domain variable set to 3, and so I did.
1 2 3 4 5 6 7 8 9 for i in $(az account list-locations -o tsv --query .name); do if az group create -o none -l $i -n rg-as-$i 2> /dev/null; then if az vm availability-set create -o none -n as-$i -g rg-as-$i -l $i --platform-fault-domain-count 3 --platform-update-domain-count 3 2>/dev/null; then echo "$i,supported"; else echo "$i,unsupported"; fi; fi; done;
The script is relatively straight forward. Assuming that az cli is configured correctly, the loop starts with listing all of the regions that the account has access to, and then for each one, creates a resource group, and if successful, attemps to create an availability set in that region with the fault domain variable set to 3. If successful, it prints the region and that it’s supported. If not, unsupported.
At this time, only 16 regions supported three fault domains. They are:
I’m curious how this list will change over time. I’m glad that my customer had this requirement because it provided an opportunity for me to learn more about Availability Sets and how their implementation can vary between regions.
Tweet to @kriation