Tag Archives: Cloud Computing

How HP Labs nearly invented the cloud

On the heels of HP’s news of not-quite abandoning the Cloud, there is coverage of how AWS stole a march on Sun’s plans to provide compute-on-demand. The timeline for AWS starts late 2003 when an internal team in Amazon hatched a plan that among other things could offer virtual servers as a retail offering. Sun’s offering involved bare metal and running jobs, not virtual machines.

In a paper published in 2004 a group of researchers at HP Labs proposed what they called “SoftUDC” – a software-based utility data center. The project involved:

  • API access  to virtual resources
  • Virtualization using the Xen Hypervisor
  • Network virtualization using UDP overlays almost identical to VxLAN
  • Virtual Volumes accessible over the network from any virtual machine (like EBS)
  • “Gatekeeper” software in the hypervisor that provides the software and network virtualization
  • Multi-tier networking using subnetting and edge appliances (“VPC”)
  • Automated OS and application upgrades using the “cattle” technique (just replace instead of upgrade).
  • Control at the edge: firewalls, encryption and QoS guarantees provided at the hypervisor

Many of these ideas now seem “obvious”, but remember this was 2004. Many of these ideas were even implemented. For example, VNET is the name of the network virtualization stack / protocol. This was implemented as a driver in Xen dom0 that would take Ethernet frames exiting the hypervisor and encapsulate them in UDP frames.

Does this mean HP could have been the dominant IAAS player instead of AWS if it only had acted on its Labs innovation? Of course not. But, lets say in 2008 when AWS was a clear danger, it could’ve dug a little deeper inside its own technological inventory to produce a viable competitor early on.  Instead we got OpenStack.

Many of AWS’s core components are based on similar concepts: the Xen hypervisor, network virtualization, virtual volumes, security groups, and so on. No doubt they came up with these concepts on their own — more importantly they implemented them and had a strategy for building a business around it.

Who knows what innovations are cooking today in various big companies, only to get discarded as unviable ideas. This can be framed as the Innovator’s Dilemma as well.

Advertisements

How to manage a million firewalls – part 2

Continuing from my last post where I hinted about the big distributed systems problem involved in managing a CloudStack Basic Zone.

It helps to understand how CloudStack is architected at a high level. CloudStack is typically operated as a cluster of identical Java applications (called the “Management Server” or “MS”). There is a MySQL database that holds the desired state of the cloud. API calls arrive at a management server (through a load balancer). The management server uses the current state as stored in the MySQL database, computes/stores a new state and communicates any changes to the cloud infrastructure.

sg_groups_pptx8

In response to an API call, the management server(s) usually have to communicate with one or more hypervisors. For example, adding a rule to a security group (a single API call)  could involve communicating changes to dozens or hundreds of hypervisors. The job of communicating with the hypervisors is split (“sharded”) among the cluster members. For example if there’s 3000 hypervisors and 3 management servers, then each MS handles communications with 1000 hypervisors. If the API call arrives at MS ‘A’ and needs to update a hypervisor managed by MS ‘B’, then the communication is brokered through B.

Now updating a thousand firewalls  (remember, the firewalls are local to the hypervisor) in response to a single API call requires us to think about the API call semantics. Waiting for all 1000 firewalls to respond could take a very long time. The better approach is to return success to the API and work in the background to update the 1000 firewalls. It is also likely that the update is going to fail on a small percentage of the firewalls. The update could fail due to any number of problems: (transient) network problems between the MS and the hypervisor, a problem with the hypervisor hardware, etc.

This problem can be described in terms of the CAP theorem as well. A piece of state (the state of the security group) is being stored on a number of distributed machines (the hypervisors in this case). When there is a network partition (P), do we want the update to the state to be Consistent (every copy of the state is the same), or do we want the API to be Available (partition-tolerant).  Choosing Availability ensures that the API call never fails, regardless of the state of the infrastructure. But it also means that the state is potentially inconsistent across the infrastructure when there is a partition.

A lot of the problems with an inconsistent state can be hand-waved away1 since the default behavior of the firewall is to drop traffic. So if the firewall doesn’t get the new rule or the new IP address, it means that inconsistency is safe: we are not letting in traffic that we didn’t want to.

A common strategy in AP systems is to be eventually consistent. That is, at some undefined point in the future, every node in the distributed system will agree on the state. So, for example, the API call needs to update a hundred hypervisors, but only 95 of them are available. At some point in the future, the remaining 5 do become available and are updated to the correct state.

When a previously disconnected hypervisor reconnects to the MS cluster, it is easy to bring it up to date, since the authoritative state is stored in the MySQL database associated with the CloudStack MS cluster.

A different distributed systems problem is to deal with concurrent writes. Let’s say you send a hundred API calls in quick succession to the MS cluster to start a hundred VMs. Each VM creation leads to changes in many different VM firewalls. Not every API call lands on the same MS: the load balancer in front of the cluster will distribute it to all the machines in the cluster. Visualizing the timeline:

sg_groups_pptx9

A design goal is to push the updates to the VM firewalls as soon as possible (this is to minimize the window of inconsistency). So, as the API calls arrive, the MySQL database is updated and the new firewall states are computed and pushed to the hypervisors.

While MySQL concurrency primitives allow us to safely modify the database (effectively serializing the updates to the security groups), the order of updates to the database may not be the order of updates that flow to the hypervisor. For example, in the table above, the firewall state computed as a result of the API call at T=0 might arrive at the firewall for VM A after the firewall state computed at T=2. We cannot accept the “older” update.sg_groups_pptx10

The obvious2 solution is to insert the order of computation in the message (update) sent to the firewall. Every time an API call results in a change to the state of a VM firewall, we update a persistent sequence number associated with that VM. That sequence number is transmitted to the firewall along with the new state. If the firewall notices that the latest update received is “older” than the one it is has already processed, it just ignores it. In the figure above, the “red” update gets ignored.

An crucial point is that every update to the firewall has to contain the complete state: it cannot just be the delta from the previous state3.

The sequence number has to be stored on the hypervisor so that it can compare the received sequence number. The sequence number also optimizes the updates to hypervisors that reconnect after a network partition has healed: if the sequence number matches, then no updates are necessary.

Well, I’ve tried to keep this part under a thousand words. The architecture discussed here did not converge easily — there was a lot of mistakes and learning along the way. There is no way for other cloud / orchestration systems to re-use this code, however, I hope the reader will learn from my experience!


1. The only case to worry about is when rules are deleted: an inconsistent state potentially means we are allowing traffic when we didn’t intend to. In practice, rule deletes are a very small portion of the changes to security groups. Besides if the rule exists because it was intentionally created — it probably is OK to take a little time to delete it
2. Other (not-so-good) solutions involve locks per VM, and queues per VM
3. This is a common pattern in orchestrating distributed infrastructure

How to manage a million firewalls – part 1

In my last post I argued that security groups eliminate the need for network security devices in certain parts of the datacenter. The trick that enables this is the network firewall in the hypervisor. Each hypervisor hosts dozens or hundreds of VMs — and provides a firewall per VM. The figure below shows a typical setup, with Xen as the hypervisor. Ingress network traffic flows through the hardware into the control domain (“dom0”) where it is switched in software (so called virtual switch or vswitch) to the appropriate VM.

sg_groups_pptx7

The vswitch provides filtering functions that can block or allow certain types of traffic into the VM. Traffic between the VMs on the same hypervisor goes through the vswitch as well. The vswitch used in this design is the Linux Bridge; the firewall function is provided by netfilter ( “iptables”).

Security groups drop all traffic by default and only allow those configured by the rules. Suppose the red VMs in the figure (“Guest 1” and “Guest 4”) are in a security group “management”. We want to allow access to them from the subnet 192.168.1.0/24 on port 22 (ssh). The iptables rules might look like this:

iptables -A FORWARD -p tcp --dport 22 --src 192.168.1.0/24 -j ACCEPT 
iptables -A FORWARD -j DROP

Line 1 reads: for packets forwarded across the bridge (vswitch) that are destined for port 22, and are from source 192.168.1.0/24, allow (ACCEPT) them. Line 2 reads: DROP everything. The rules form a chain: packets traverse the chain until they match. (this is highly simplified: we want to match on the particular bridge ports that are connected to the VMs in question as well).

Now, let’s say we want to allow members of the ‘management’ group access their members over ssh as well. Let’s say there are 2 VMs in the group, with IPs of ‘A’ and ‘B’.  We calculate the membership and for each VM’s firewall, we write additional rules:

#for VM A
iptables -I FORWARD -p tcp --dport 22 --source B -j ACCEPT
#for VM B
iptables -I FORWARD -p tcp --dport 22 --source A -j ACCEPT

As we add more VMs to this security group, we have to add more such rules to each VM’s firewall. (A VM’s firewall is the chain of iptables rules that are specific to the VM).  If there are ‘N’ VMs in the security group, then each VM has N-1 iptables rules for just this one security group rule. Remember that a packet has to traverse the iptables rules until it matches or gets dropped at the end. Naturally each rule adds latency to a packet (at least to the connection-initiating ones).  After a certain number (few hundreds) of rules, the latency tends to go up hockey-stick fashion. In a large cloud, each VM could be in several security groups and each security group could have rules that interact with other security groups — easily leading to several hundred rules.

Aha, you might say, why not just summarize the N-1 source IPs and write a single rule like:

iptables -I FORWARD -p tcp --dport 22 --source <summary cidr> -j ACCEPT

Unfortunately, this isn’t possible since it is never guaranteed that the N-1 IPs will be in a single CIDR block. Fortunately this is a solved problem: we can use ipsets. We can add the N-1 IPs to a single named set (“ipset”). Then:

ipset -A mgmt <IP1>
ipset -A mgmt <IP2>
...
iptables -I FORWARD -p tcp --dport 22 -m set match-set mgmt src -j ACCEPT

IPSets matching is usually very fast and fixes the ‘scale up’ problem. In practice, I’ve seen it handle tens of thousands of IPs without significantly affecting latency or CPU load.

The second (perhaps more challenging) problem is that when the membership of a group changes, or a rule is added / deleted, a large number of VM firewalls have to be updated. Since we want to build a very large cloud, this usually means thousands or tens of thousands of hypervisors have to be updated with these changes. Let’s say in the single group/single rule example above, there are 500 VMs in the security groups. Adding a VM to the group means that 501 VM firewalls have to be updated. Adding a rule to the security group means that 500 VM firewalls have to be updated. In the worst case, the VMs are on 500 different hosts — making this a very big distributed systems problem.

If we consider a typical datacenter of 40,000 hypervisor hosts, with each hypervisor hosting an average of 25 VMs, this becomes the million firewall problem.

Part 2 will examine how this is solved in CloudStack’s Basic Zone.

Is AWS S3 the CDO of the Cloud?

The answer: not really, but the question needs examination.

One of the causes of the financial crisis of 2008 was the flawed ratings of complex financial instruments by supposed experts (ratings bodies such as S&P). Instruments such as CDOs comingled mortgages of varying risk levels, yet managed to get the best (AAA) ratings. The math (simplified) was that the likelihood of all the component mortgages failing at the same time was very low, and hence the CDO (more accurately the AAA rated portion of the CDO) itself was safe. For example if the probability of a single mortgage defaulting is 5%, then the probability of 5 mortgages defaulting at the same time is 1 in 3 million. The painful after effects of the bad assumptions underlying the math is still being felt globally, 5 years later.

If there is a AAA equivalent in the Cloud, it is AWS S3 which promises “11 9s of durability” (99.999999999%) or:

“if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years”.

However, AWS does not give us the mathematical reasoning behind it. We have some clues:

“Amazon S3 redundantly stores your objects on multiple devices across multiple facilities in an Amazon S3 Region. The service is designed to sustain concurrent device failures by quickly detecting and repairing any lost redundancy”.

Let’s assume S3 stores 3 copies of each object in 3 datacenters that do not share the same failure domains. For example the datacenters each can have different sources of power, be located in different earthquake fault zones and flood zones. So, by design the possibility of 3 simultaneous failures is presumably low. Intuitively, if the probability of losing one object is 10-4, then the possibility of losing all 3 is (10-4)3 or 10-12. This is an oversimplification, there are other factors in the calculation such as the time to recover a failed copy. This graph (courtesy of Mozy) shows that the probability of losing an object is far lower with 3 copies (blue) than with 2 copies (yellow).

Image

The fatal flaw with the CDO ratings was that the failures of the component mortgages did chillingly correlate when the housing bubble burst : something that the ratings agency had not considered in their models. Perhaps the engineers and modelers at Amazon are far better at prediction science. But remember that the Fukushima nuclear disaster was also a result of a failure to account for an earthquake larger than 8.6. What we do know that at least in the US Standard region, AWS stores the object bi-coastally, so indeed the odds of a natural disaster simultaneously wiping out everything must indeed be quite low. There’s still manmade disasters (bugs, malicious attacks) of course. But given the uncertainty of such events (known unknowns), it is likely that the 10e-11 figure ignores such eventualities. Other AWS regions (EU, Japan, etc) do not store the objects with such wide geographic separation, but there are no caveats on the 10e-11 figure for those regions, so I’m guessing that the geographical separation in the U.S. Standard region does not figure in the 10e-11 calculation.

Interestingly none of S3’s competitors make similar claims. I can’t find Google Storage or Azure Storage making similar claims (or any claims). This riposte from the Google Storage team says

“We don’t believe that quoting a number without hard data to back it up is meaningful to our customers … we can’t share the kind of architectural information necessary to back up a durability number”.

Windows Azure storage SLA quotes an “availability” number of 99.9%. This availability number is same as that of Amazon S3.

Waitaminnit. What happened to the 11 9’s we saw above?

The difference between availability and durability is that while the object (at least 1 copy) might exist in AWS S3, it may not be available via the S3 APIs. So there’s a difference of 8 nines between the availability and durability of an S3 object.

Given the actual track record of AWS S3, it is perhaps time to revise the durability  estimates of this amazing service. If I were an enterprise or government organization considering moving my applications to the AWS cloud, I would certainly want to examine the math and modeling behind the figures. During congressional testimony, the ratings agencies defended their work by saying that it was impossible to anticipate the housing downturn: but clearly lots of people had anticipated it and made a killing by shorting those gold-plated CDOs. The suckers who trusted the ratings agencies blindly failed to do their own due diligence.

I do believe that storing your data in AWS S3 is less risky than your own data center and probably cheaper than storing it across your own multiple data centers. But it behooves one to use a backup: preferably with another provider such as Google Storage. As chief cloud booster Adrian Cockroft of Netflix says: