Orchestration is a somewhat overloaded term in the context of automation. Generally, it implies a central controller that tries to bring a complicated system to a desired state. There are usually a large number of subsystems that the controller manages. Changing the state of the system involves communicating with the subsystems in order to get them to change their state. The communication usually happens over a network.
As a simple example, consider a home automation controller that is trying to get the home ready to receive its occupants by:
- Setting the indoor temperature by setting the thermostats
- Opening the garage door
- Turning on lights
- Turning on the tea kettle
The network however is unreliable. There are several failure modes to consider:
- The message from the controller may never reach the subsystem. Usually the subsystem acknowledges the control messages from the controller. The controller may implement a timeout so that if the subsystem never gets the message, the controller times out waiting for the acknowledgement and executes some kind of recovery
- The message may reach the subsystem but the subsystem is not ready or not in a state to process it. The controller will get a negative acknowledgement in this case and needs to execute another kind of fault recovery procedure.
- The message reaches the subsystem and the subsystem executes the requested control, but fails to complete the requested task. For example, it may request a downstream subsystem to execute a task, but that downstream subsystem fails (again, perhaps due to the network). The controller may or may not get a different negative acknowledgement in this case. The subsystem may even fail midway through the task.
- The subsystem gets the message, executes the task perfectly, but the acknowledgement never reaches the controller. The controller usually times out and executes some kind of fault recovery procedure.
Distinguishing between these kinds of failures at the controller is a little hard. If there is a timeout, it can’t determine if the subsystem performed the requested task or not. A common recovery procedure is to re-try the command to the subsystem. Within this recovery mode, the controller has to decide:
- how many retries
- how long to retry
- when to alert a human
Depending on the semantics of the task, there are different answers. Consider an orchestration flow where the controller has to set up a virtual machine. The tasks involved could be to allocate storage, program network elements such as switches, routers and DHCP servers, choose hypervisor hosts and so on. Any of these tasks could fail. Retrying indefinitely to allocate storage when there is not enough storage available doesn’t make sense. Retrying because there was a timeout might make sense. Alerting a human when there are hundreds or thousands of subsystems being modified doesn’t scale – it is better to design recoverability into the system.
When the controller re-tries the command to the subsystem, it is possible to have an unexpected effect. Let’s say the storage subsystem in the virtual machine example did allocate the storage as requested the first time, but the controller didn’t receive the acknowledgement. The controller retries the command, resulting in double allocation at the storage system.
The solution in this case is to ensure that the commands from the controller to the subsystem are idempotent. That is, executing the same command multiple times produces the same result. The trick is uniquely identify the change that is being requested. The subsystem stores/remembers the identifier so that if the change is re-requested, it doesn’t re-do the change. The identifier can be opaque (i.e., the structure or contents of the id have no semantics, like a uuid ) or be derived from the state description sent to the subsystem (e.g., a file name). Opaque identifiers help avoid leaky abstractions between the controller and the subsystem. In many cases the subsystem cannot be modified to be idempotent (e.g., proprietary systems, different admin space), so a non-opaque identifier has to be used. Examples include fully-qualified domain names, filesystem paths and IP addresses.
The idempotency trick helps in another corner case: where the subsystem reboots / re-initializes or gets recreated due to a failure: it may not know the last command / desired state sent by the controller. For example, consider the case of the home automation system where a defective thermostat is replaced. The new thermostat contacts the home automation controller. The controller re-sends the last control command. Since the new thermostat doesn’t have a record of the unique identifier in the command, it applies the change requested by the command.
A complex system with many subsystems and resources is constantly changing state independent of the controller. For example, hosts reboot, network switches go down, disks fail, and so on. The controller has to detect when the system has drifted from the desired state and then execute compensating commands to the subsystems to bring them back to the desired state. Having idempotent commands with unique identifiers is crucial to this recovery.
Architects of orchestration controllers often discover the need for idempotent operations well after implementation is in production. Since the controller usually in turn offers an API, the system architect has to ensure that this “northbound” API also supports idempotent commands / operations. Even Amazon Web Services (AWS) introduced idempotent run-instances quite late in the game (2010).