Windows Azure includes concept of fault and upgrade domains. Fault domains are about physical deployment of the roles whereas upgrade domains are relate to logical deployment of the roles.

Fault Domain

Fault domains define a physical unit of deployment for an application. The fault domain concept has been introduced in order for Windows Azure to provide high availability services and to reduce single points of failure (servers, rack of servers, switches) in a datacenter. In Windows Azure a rack of computers is indeed identified as a fault domain.

Service instance allocation to a specific fault domain is determined by Windows Azure at deployment time and cannot be controlled by a service owner. By placing fault domains in separate racks of computers, you separate service instances deployment to hardware well enough that it’s unlikely all would fail at the same time.

Note that in order to get guaranteed SLA at a level of 99.95% you have to have two or more role instances in different which will be deployed to different fault domains. You can find more on Cloud Service SLA at Service Level Agreements page.

Upgrade Domain

Upgrade domains define a logical unit of deployment for an application. The upgrade domain concept has been introduced in order for Windows Azure to provide high availability services during upgrade of an application.

Number of upgrade domains can be configured as a part of service definition file (.csdef). The default number of upgrade domains is 5 and the maximum is 20.

<ServiceDefinition
    name="<service-name>" 
    xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition"
    schemaVersion="<version>"
    upgradeDomainCount="<number-of-upgrade-domains>">

    <!-- .... -->

</ServiceDefinition>

Windows Azure distributes instances of a role evenly (when possible) across a set number of upgrade domains. For example, if the default number of upgrade domains is used and a service has 5 instances, each instance will be assigned to an individual upgrade domain. In case of a service having 10 instances, each upgrade domain will have 2 instances. In case of a service having 14 instances, the first four upgrade domains will have 3 instances and the last one will have 2 instances.

Cloud Service Upgrade Domains

Note that a service instance allocation to a particular upgrade domain is determined by Windows Azure at deployment time and it cannot be controlled by the service owner.

Note that number of upgrade domains does not have to equal to number of fault domains so a single application could easily exist in several upgrade domains but only deployed to two separate fault domains.

How a deployment proceeds

During deployment all instances of the the upgraded role that belong to the first upgrade domain are stopped, upgraded, and brought back online. Once they are back online, the process is repeated for the second upgrade domain (all roles stopped, upgraded and brought back online), the third upgrade domain and so on until all instances in all upgrade domains have been upgraded.

The screenshots below present Windows Azure portal - cloud service instances screen - during upgrade of a service with 3 instances and 5 (default value) upgrade domains.

Upgrade of instances in the first upgrade domain (index 0)

Upgrade of instances in the second upgrade domain (index 1)

Upgrade of instances in the third upgrade domain (index 2)

Please note that during deployment you can decide whether you want to update all of the roles in your service or a single role in the service.

Windows Azure Role Selection