Continuing from my last post where I hinted about the big distributed systems problem involved in managing a CloudStack Basic Zone.
It helps to understand how CloudStack is architected at a high level. CloudStack is typically operated as a cluster of identical Java applications (called the “Management Server” or “MS”). There is a MySQL database that holds the desired state of the cloud. API calls arrive at a management server (through a load balancer). The management server uses the current state as stored in the MySQL database, computes/stores a new state and communicates any changes to the cloud infrastructure.
In response to an API call, the management server(s) usually have to communicate with one or more hypervisors. For example, adding a rule to a security group (a single API call) could involve communicating changes to dozens or hundreds of hypervisors. The job of communicating with the hypervisors is split (“sharded”) among the cluster members. For example if there’s 3000 hypervisors and 3 management servers, then each MS handles communications with 1000 hypervisors. If the API call arrives at MS ‘A’ and needs to update a hypervisor managed by MS ‘B’, then the communication is brokered through B.
Now updating a thousand firewalls (remember, the firewalls are local to the hypervisor) in response to a single API call requires us to think about the API call semantics. Waiting for all 1000 firewalls to respond could take a very long time. The better approach is to return success to the API and work in the background to update the 1000 firewalls. It is also likely that the update is going to fail on a small percentage of the firewalls. The update could fail due to any number of problems: (transient) network problems between the MS and the hypervisor, a problem with the hypervisor hardware, etc.
This problem can be described in terms of the CAP theorem as well. A piece of state (the state of the security group) is being stored on a number of distributed machines (the hypervisors in this case). When there is a network partition (P), do we want the update to the state to be Consistent (every copy of the state is the same), or do we want the API to be Available (partition-tolerant). Choosing Availability ensures that the API call never fails, regardless of the state of the infrastructure. But it also means that the state is potentially inconsistent across the infrastructure when there is a partition.
A lot of the problems with an inconsistent state can be hand-waved away1 since the default behavior of the firewall is to drop traffic. So if the firewall doesn’t get the new rule or the new IP address, it means that inconsistency is safe: we are not letting in traffic that we didn’t want to.
A common strategy in AP systems is to be eventually consistent. That is, at some undefined point in the future, every node in the distributed system will agree on the state. So, for example, the API call needs to update a hundred hypervisors, but only 95 of them are available. At some point in the future, the remaining 5 do become available and are updated to the correct state.
When a previously disconnected hypervisor reconnects to the MS cluster, it is easy to bring it up to date, since the authoritative state is stored in the MySQL database associated with the CloudStack MS cluster.
A different distributed systems problem is to deal with concurrent writes. Let’s say you send a hundred API calls in quick succession to the MS cluster to start a hundred VMs. Each VM creation leads to changes in many different VM firewalls. Not every API call lands on the same MS: the load balancer in front of the cluster will distribute it to all the machines in the cluster. Visualizing the timeline:
A design goal is to push the updates to the VM firewalls as soon as possible (this is to minimize the window of inconsistency). So, as the API calls arrive, the MySQL database is updated and the new firewall states are computed and pushed to the hypervisors.
While MySQL concurrency primitives allow us to safely modify the database (effectively serializing the updates to the security groups), the order of updates to the database may not be the order of updates that flow to the hypervisor. For example, in the table above, the firewall state computed as a result of the API call at T=0 might arrive at the firewall for VM A after the firewall state computed at T=2. We cannot accept the “older” update.
The obvious2 solution is to insert the order of computation in the message (update) sent to the firewall. Every time an API call results in a change to the state of a VM firewall, we update a persistent sequence number associated with that VM. That sequence number is transmitted to the firewall along with the new state. If the firewall notices that the latest update received is “older” than the one it is has already processed, it just ignores it. In the figure above, the “red” update gets ignored.
An crucial point is that every update to the firewall has to contain the complete state: it cannot just be the delta from the previous state3.
The sequence number has to be stored on the hypervisor so that it can compare the received sequence number. The sequence number also optimizes the updates to hypervisors that reconnect after a network partition has healed: if the sequence number matches, then no updates are necessary.
Well, I’ve tried to keep this part under a thousand words. The architecture discussed here did not converge easily — there was a lot of mistakes and learning along the way. There is no way for other cloud / orchestration systems to re-use this code, however, I hope the reader will learn from my experience!
1. The only case to worry about is when rules are deleted: an inconsistent state potentially means we are allowing traffic when we didn’t intend to. In practice, rule deletes are a very small portion of the changes to security groups. Besides if the rule exists because it was intentionally created — it probably is OK to take a little time to delete it↩
2. Other (not-so-good) solutions involve locks per VM, and queues per VM ↩
3. This is a common pattern in orchestrating distributed infrastructure↩