How HP Labs nearly invented the cloud

On the heels of HP’s news of not-quite abandoning the Cloud, there is coverage of how AWS stole a march on Sun’s plans to provide compute-on-demand. The timeline for AWS starts late 2003 when an internal team in Amazon hatched a plan that among other things could offer virtual servers as a retail offering. Sun’s offering involved bare metal and running jobs, not virtual machines.

In a paper published in 2004 a group of researchers at HP Labs proposed what they called “SoftUDC” – a software-based utility data center. The project involved:

  • API access  to virtual resources
  • Virtualization using the Xen Hypervisor
  • Network virtualization using UDP overlays almost identical to VxLAN
  • Virtual Volumes accessible over the network from any virtual machine (like EBS)
  • “Gatekeeper” software in the hypervisor that provides the software and network virtualization
  • Multi-tier networking using subnetting and edge appliances (“VPC”)
  • Automated OS and application upgrades using the “cattle” technique (just replace instead of upgrade).
  • Control at the edge: firewalls, encryption and QoS guarantees provided at the hypervisor

Many of these ideas now seem “obvious”, but remember this was 2004. Many of these ideas were even implemented. For example, VNET is the name of the network virtualization stack / protocol. This was implemented as a driver in Xen dom0 that would take Ethernet frames exiting the hypervisor and encapsulate them in UDP frames.

Does this mean HP could have been the dominant IAAS player instead of AWS if it only had acted on its Labs innovation? Of course not. But, lets say in 2008 when AWS was a clear danger, it could’ve dug a little deeper inside its own technological inventory to produce a viable competitor early on.  Instead we got OpenStack.

Many of AWS’s core components are based on similar concepts: the Xen hypervisor, network virtualization, virtual volumes, security groups, and so on. No doubt they came up with these concepts on their own — more importantly they implemented them and had a strategy for building a business around it.

Who knows what innovations are cooking today in various big companies, only to get discarded as unviable ideas. This can be framed as the Innovator’s Dilemma as well.

Advertisements

Back to Basics: CloudStack Basic Networking

The first choice to make when creating a zone in Apache CloudStack is the network type: basic or advanced. The blurb for “Advanced” promises “sophisticated network topologies”, while Basic promises “AWS-style networking”. Those who cut their teeth on the AWS cloud in 2008 may fondly remember what AWS now calls “EC2 classic platform“.

Platform Introduced In Description
EC2-Classic The original release of Amazon EC2 Your instances run in a single, flat network that you share with other customers.
EC2-VPC The original release of Amazon VPC Your instances run in a virtual private cloud (VPC) that’s logically isolated to your AWS account.

With a few differences, these map to CloudStack’s “Basic” and “Advanced” networking.

The fundamental network isolation technique in Basic zone is security groups. By default all network access to a VM is denied. When you launch an instance (VM), you deploy it in one or more security groups.

Using cloudmonkey:

> deploy virtualmachine securitygroupnames=management,web displayname=web0001 templateid=7464f3a6-ec56-4893-ac51-d120a71049dd serviceofferingid=48f813b7-2061-4270-93b2-c873a0fac336 zoneid=c78c2018-7181-4c7b-ab08-57204bc2eed3

Of course you have to create the security groups first:

> create securitygroup name=web
> create securitygroup name=management

Security groups are containers for firewall rules.

> authorize securitygroupingress  securitygroupname=web protocol=tcp startport=80 endport=80 cidrlist=0.0.0.0/0
> authorize securitygroupingress   securitygroupname=management protocol=tcp startport=22 endport=22 cidrlist=192.168.1.0/24

In a basic zone, all network access to a VM is denied by default. These two rules allow access to our VM on the HTTP port (80) from anywhere and on the SSH port (22) only from computers in the 192.168.1.0/24 subnet.

Let’s start another web VM with these security groups

> deploy virtualmachine securitygroupnames=management,web displayname=web0002 ...

We can log in to web0002 over ssh when our ssh client is in the 192.168.1.0/24 subnet. But when we try to login to web0001 from web0002 over ssh, we get denied, since neither of the ingress rules we wrote above allow that. We can fix that:

> authorize securitygroupingress   securitygroupname=management protocol=tcp startport=22 endport=22 securitygroupname=management

As long as the ssh client is on a VM in the management security group ssh access is allowed to any other VM in the management security group.

Let’s create some more:

Create appserver group and a db group

> create securitygroup name=appserver
> create securitygroup name=db

Let’s add these rules: Allow web VMs access to the app servers on port 8080. Allow app servers access to the DB VMs on the MySQL port (3306)

> authorize securitygroupingress   securitygroupname=appserver protocol=tcp startport=8080 endport=8080 securitygroupname=web
> authorize securitygroupingress   securitygroupname=db protocol=tcp startport=3306 endport=3306 securitygroupname=appserver

Deploy some virtual machines (instances) in these groups….

> deploy virtualmachine securitygroupnames=management,appserver displayname=app0001 ...
> deploy virtualmachine securitygroupnames=management,appserver displayname=app0002 ...
> deploy virtualmachine securitygroupnames=management,db displayname=db0001 ...

The network security architecture now looks like

sg_groups_pptx3

Pretty complicated, with just a handful of rules. The beauty of it is that it captures the intent accurately. After all, as a network admin you want to exactly say “allow app VMs access to the DB VMs on tcp port 3306”.

In a traditional network, you’d create subnets and insert security devices between them. On the security device you would have entered complicated ACLs. The ACLs may have to be changed any time you created / destroyed VMs. In a Basic Zone, once you define the groups and rules, everything is taken care of automatically. You can even edit the rules after the VMs are running. Let’s allow ICMP pings to all the VMs from our management subnet:

> authorize securitygroupingress securitygroupname=management protocol=icmp icmptype=-1 icmpcode=-1

To do the inverse:

> revoke securitygroupingress securitygroupname=management protocol=icmp ...

A significant difference from EC2-classic is that CloudStack allows you to create egress rules as well:

> authorize securitygroupegress securitygroupname=management protocol=icmp ...

This controls traffic out of the VMs. The egress rules only take effect the first time an egress rule is added to the security group. Once the first rule is added, egress is by default ‘deny all’ and only the specific egress rules allow traffic out of the VM.

Security group rules are stateful. This means that you don’t have to define a corresponding egress rule for every ingress rule. For example, when someone from the internet connects on port 80 to a web VM, response traffic (out of the web VM) associated with that connection is automatically allowed. Stateful connection tracking also apply to stateless protocols such as UDP and ICMP.

While ‘Basic’ Zone might seem well, basic, it offers powerful network isolation techniques that directly map your intent. The simple interface actually masks a sophisticated implementation which I hope to describe in a future post. I hope I have convinced you that ‘Basic’ is indeed sophisticated!

Read on for a deeper dive.

How did they build that — EC2 Enhanced Networking

Among the flurry of new features introduced by AWS in 2013, is a performance enhancement known as ‘Enhanced Networking‘. According to the blurb: ” enhanced networking on your instance results in higher performance (packets per second), lower latency, and lower jitter’. The requirements are that you install an Intel 10GbE driver (ixgbef) in your instance and enable a feature called SR-IOV.

The AWS cloud is built around virtualization technology — specifically your instances are virtual machines running on top of a version of the open source Xen Hypervisor.
The hypervisor is what guarantees the isolation between my instance and your instance when they both run on the same set of CPUs.

The hypervisor intercepts all I/O from the virtual machine so that the virtual machine is abstracted from the hardware — this provides security as well as portability since the VM doesn’t need to care about the drivers for the I/O hardware. The VM sees a NIC that is software defined and as a result the hypervisor can inspect all traffic to and from the VM. -This allows AWS to control the networking traffic between the VM and the rest of the infrastructure. This is used to deliver features such as security groups and ACL.

The downside of processing all network traffic to/from the VM is that the host CPU cycles are consumed processing this traffic. This is quite a significant overhead compared to a bare-metal instance. The hypervisor needs to apply stateful firewall rules on every packet, switch the packet and encapsulate it. Some estimates put this overhead as high as 70% of the CPU available to the hypervisor (at 10 Gb/s). Software processing also introduces problems of noisy neighbors — variable jitter and high latency at 10Gbps are common.

Slide2

Fortunately, SR-IOV (Single Root IO Virtualization) provides a direct path for the VM to access the underlying hardware NIC. Bypassing the hypervisor leads to line-rate performance. Enhanced Networking takes advantage of this: in order to benefit from this, your AMI needs to have SR-IOV drivers installed in it.

Slide3

Great — but now that the hypervisor is out of the path, how does AWS provide software-defined features such as security groups and ACL? The current generation of SR-IOV NICs (AWS uses the Intel 82599 ) do not have stateful firewalls or the ability to have process large number of ACL. Furthermore, we know that AWS must be using some kind of encapsulation / tunnelling so that VPC are possible. The Intel 82599 does not provide encapsulation support.

The solution then would be to do the extra processing elsewhere — either off the host or in the host, using a co-processor. This schematic shows processing happening at the TOR switch. The drawback is that even intra-host traffic has to be tromboned via the TOR. Furthermore the switch now becomes a pretty big bottleneck and a failure in the switch could lead to several hosts losing network connectivity.Slide4

 

Using a co-processor would be the best solution. Tilera is one such processor that comes to mind. Since the Tilera provides general purpose processing cores, the encap/decap/filtering/stateful firewall processing could be done in software instead of ASICs or FPGAs.

Slide5

 

The software/hardware solution could allow AWS to introduce further innovations in its networking portfolio, including end-to-end encryption, IDS and IPS.

Disclaimer: I have no knowledge of AWS internals. This is just an exploration of “how did they build it?”.

Update: a confirmation of sorts on Werner Vogel’s blog: http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html

Do-it-yourself CloudWatch-style alarms using Riemann

AWS CloudWatch is a web service that enables the cloud user to collect, view, and analyze metrics about your AWS resources and applications. CloudWatch alarms send notifications or trigger autoscale actions based on rules defined by the user. For example, you can get an email from a CloudWatch alarm if the average latency of your web application stays over 2 ms for 3 consecutive 5 minute periods. The variables that CloudWatch let you set are the metric (average), the threshold (2ms) and the duration (3 periods). 

I became interested in replicating this capability after a recent discussion / proposal about auto scaling on the CloudStack mailing list. It was clearly some kind of Complex Event Processing (CEP) — and it appeared that there were a number of Open Source tools out there to do this. Among others (Storm, Esper), Riemann stood out as being built for purpose (for monitoring distributed systems) and offered the quickest way to try something out.

As the blurb says “Riemann aggregates events from your servers and applications with a powerful stream processing language”. You use any number of client libraries to send events to a Riemann server. You write rules in the stream processing language (actually Clojure) that the Riemann server executes on your event stream. The result of processing the rule can be another event or an email notification (or any number of actions such as send to pagerduty, logstash, graphite, etc). 

Installing Riemann is a breeze; the tricky part is writing the rules. If you are a Clojure noob like me, then it helps to browse through a Clojure guide first to get a feel for the syntax. You append your rules to the etc/riemann.config file. Let’s say we want to send an email whenever the average web application latency exceeds 6 ms over 3 periods of 3 seconds. Here’s one solution:

The keywords fixed-time-window, combine, where, email, etc are Clojure functions provided out of the box by Riemann. We can write our own function tc to make the above general purpose:

 We can make the function even more general purpose by letting the user pass in the summarizing function, and the comparison function as in:

;; left as an exercise
(tc 3 3 6.0 folds/std-dev <
(email "itguy@onemorecoolapp.net"))

Read that as : if the standard deviation of the metric falls below 6.0 for 3 consecutive windows of 3 seconds, send an email.

To test this functionality, here’s the client side script I used:

[~/riemann-0.2.4 ]$ irb -r riemann/client
1.9.3-p374 :001 > r = Riemann::Client.new
1.9.3-p374 :002 > t = [5.0, 5.0, 5.0, 4.0, 6.0, 5.0, 8.0, 9.0, 7.0, 7.0, 7.0, 7.0, 8.0, 6.0, 7.0, 5.0, 7.0, 6.0, 9.0, 3.0, 6.0]
1.9.3-p374 :003 > for i in (0..20)
1.9.3-p374 :004?> r << {host: "www1", service: "http req", metric: t[i], state: "ok", description: "request", tags:["http"]}
1.9.3-p374 :005?> sleep 1
1.9.3-p374 :006?> end

This generated this event:

{:service "http req", :state "threshold crossed", :description "service  crossed the value of 6.0 over 3 windows of 3 seconds", :metric 7.0, :tags ["http"], :time 1388992717, :ttl nil}

While this is more awesome than CloudWatch alarms (define your own summary, durations can be in second granularity vs minutes), it lacks the precise semantics of CloudWatch alarms:

  • The event doesn’t contain the actual measured summary in the periods the threshold was crossed.
  • There needs to be a state for the alarm (in CloudWatch it is INSUFFICIENT_DATA, ALARM, OK). 

This is indeed fairly easy to do in Riemann; hopefully I can get to work on this and update this space. 

This does not constitute a CloudWatch alarm service though: there is no web services API to insert metrics, CRUD alarm definitions or multi-tenancy. Perhaps one could use a Compojure-based API server and embed Riemann, but this is not terribly clear to me how at this point. Multi-tenancy, sharding, load balancing, high availability, etc are topics for a different post.

To complete the autoscale solution for CloudStack would also require Riemann to send notifications to the CloudStack management server about threshold crossings. Using CloStack, this shouldn’t be too hard. More about auto scale in a separate post.

Stackmate : execute CloudFormation templates on CloudStack

AWS CloudFormation provides a simple-yet-powerful way to create ‘stacks’ of Cloud resources with a single call. The stack is described in a parameterized template file; creation of the stack is a simple matter of providing stack parameters. The template includes description of resources such as instances and security groups and provides a language to describe the ordering dependencies between the resources.

CloudStack doesn’t have any such tool (although it has been discussed). I was interested in exploring what it takes to provide stack creation services to a CloudStack deployment. As I read through various sample templates, it was clear that the structure of the template imposed an ordering of resources. For example, an ‘Instance’ resource might refer to a ‘SecurityGroup’ resource — this means that the security group has to be created successfully first before the instance can be created. Parsing the LAMP_Single_Instance.template for example, the following dependencies emerge:

WebServer depends on ["WebServerSecurityGroup", "WaitHandle"]
WaitHandle depends on []
WaitCondition depends on ["WaitHandle", "WebServer"]
WebServerSecurityGroup depends on []

This can be expressed as a Directed Acyclic Graph — what remains is to extract an ordering by performing a topological sort of the DAG. Once sorted, we need an execution engine that can take the schedule and execute it. Fortunately for me, Ruby has both: the TSort module performs topological sorts and the wonderful Ruote workflow engine by @jmettraux. Given the topological sort produced by TSort:

["WebServerSecurityGroup", "WaitHandle", "WebServer", "WaitCondition"]

You can write a process definition in Ruote:

Ruote.define my_stack do
  sequence
    WebServerSecurityGroup
    WaitHandle
    WebServer
    WaitCondition
  end
end

What remains is to implement the ‘participants‘ inside the process definition. For the most part it means making API calls to CloudStack to create the security group and instance. Here, the freshly minted CloudStack Ruby client from @chipchilders came in handy.

Stackmate is the result of this investigation — satisfyingly it is just 350 odd lines of ruby or so.

Ruote gives a nice split between defining the flow and the actual work items. We can ask Ruote to roll back (cancel) a process that has launched but not finished. We can create resources concurrently instead of in sequence. There’s a lot more workflow patterns here. The best part is that writing the participants is relatively trivial — just pick the right CloudStack API call to make.

While prototyping the design, I had to make a LOT of instance creation calls to my CloudStack installation — since I don’t have a ginormous cloud in back pocket, the excellent CloudStack simulator filled the role.

Next Steps

  • As it stands today  stackmate is executed on the command line and the workflow executes on the client side (server being CloudStack). This mode is good for CloudStack developers performing a pre-checkin test or QA developers developing automated tests. For a production CloudStack however,  stackmate needs to be a webservice and provide a user interface to launch CloudFormation templates.
  • TSort generates a topologically sorted sequence; this can be further optimized by executing some steps in parallel.
  • There’s more participants to be written to implement templates with VPC resources
  • Implement rollback and timeout

Advanced

Given ruote’s power, Ruby’s flexibility and the generality of CloudFormation templates:

  • We should be able to write CloudStack – specific templates (e.g, to take care of stuff like network offerings)
  • We should be able to execute AWS templates on clouds like Google Compute Engine
  • QA automation suddenly becomes a matter of writing templates rather than error-prone API call sequences
  • Templates can include custom resources such as 3rd party services: for example, after launching an instance, make an API call to a monitoring service to start monitoring port 80 on the instance, or for QA automation: make a call to a testing service
  • Even more general purpose complex workflows: can we add approval workflows, exception workflows and so on. For example, a manager has to approve before the stack can be launched. Or if the launch fails due to resource limits, trigger an approval workflow from the manager to temporarily bump up resource limits.