Tag Archives: cloudstack

Design patterns in orchestrators: transfer of desired state (part 3/N)

Most datacenter automation tools operate on the basis of desired state. Desired state describes what should be the end state but not how to get there. To simplify a great deal, if the thing being automated is the speed of a car, the desired state may be “60mph”. How to get there (braking, accelerator, gear changes, turbo) isn’t specified. Something (an “agent”) promises to maintain that desired speed.

desiredstate

The desired state and changes to the desired state are sent from the orchestrator to various agents in a datacenter. For example, the desired state may be “two apache containers running on host X”. An agent on host X will ensure that the two containers are running. If one or more containers die, then the agent on host X will start enough containers to bring the count up to two. When the orchestrator changes the desired state to “3 apache containers running on host X”, then the agent on host X will create another container to match the desired state.

Transfer of desired state is another way to achieve idempotence (a problem described here)

We can see that there are two sources of changes that the agent has to react to:

  1. changes to desired state sent from the orchestrator and
  2. drift in the actual state due to independent / random events.

Let’s examine #1 in greater detail. There’s a few ways to communicate the change in desired state:

  1. Send the new desired state to the agent (a “command” pattern). This approach works most of the time, except when the size of the state is very large. For instance, consider an agent responsible for storing a million objects. Deleting a single object would involve sending the whole desired state (999999 items). Another problem is that the command may not reach the agent (“the network is not reliable”). Finally, the agent may not be able to keep up with rate of change of desired state and start to drop some commands.  To fix this issue, the system designer might be tempted to run more instances of the agent; however, this usually leads to race conditions and out-of-order execution problems.
  2. Send just the delta from the previous desired state. This is fraught with problems. This assumes that the controller knows for sure that the previous desired state was successfully communicated to the agent, and that the agent has successfully implemented the previous desired state. For example, if the first desired state was “2 running apache containers” and the delta that was sent was “+1 apache container”, then the final actual state may or may not be “3 running apache containers”. Again, network reliability is a problem here. The rate of change is an even bigger potential problem here: if the agent is unable to keep up with the rate of change, it may drop intermediate delta requests. The final actual state of the system may be quite different from the desired state, but the agent may not realize it! Idempotence in the delta commands helps in this case.
  3. Send just an indication of change (“interrupt”). The agent has to perform the additional step of fetching the desired state from the controller. The agent can compute the delta and change the actual state to match the delta. This has the advantage that the agent is able to combine the effects of multiple changes (“interrupt debounce”). By coalescing the interrupts, the agent is able to limit the rate of change. Of course the network could cause some of these interrupts to get “lost” as well. Lost interrupts can cause the actual state to diverge from the desired state for long periods of time. Finally, if the desired state is very large, the agent and the orchestrator have to coordinate to efficiently determine the change to the desired state.
  4. The agent could poll the controller for the desired state. There is no problem of lost interrupts; the next polling cycle will always fetch the latest desired state. The polling rate is critical here: if it is too fast, it risks overwhelming the orchestrator even when there are no changes to the desired state; if too slow, it will not converge the the actual state to the desired state quickly enough.

To summarize the potential issues:

  1. The network is not reliable. Commands or interrupts can be lost or agents can restart / disconnect: there has to be some way for the agent to recover the desired state
  2. The desired state can be prohibitively large. There needs to be some way to efficiently but accurately communicate the delta to the agent.
  3. The rate of change of the desired state can strain the orchestrator, the network and the agent. To preserve the stability of the system, the agent and orchestrator need to coordinate to limit the rate of change, the polling rate and to execute the changes in the proper linear order.
  4. Only the latest desired state matters. There has to be some way for the agent to discard all the intermediate (“stale”) commands and interrupts that it has not been able to process.
  5. Delta computation (the difference between two consecutive sets of desired state) can sometimes be more efficiently performed at the orchestrator, in which case the agent is sent the delta. Loss of the delta message or reordering of execution can lead to irrecoverable problems.

A persistent message queue can solve some of these problems. The orchestrator sends its commands or interrupts to the queue and the agent reads from the queue. The message queue buffers commands or interrupts while the agent is busy processing a desired state request.  The agent and the orchestrator are nicely decoupled: they don’t need to discover each other’s location (IP/FQDN). Message framing and transport are taken care of (no more choosing between Thrift or text or HTTP or gRPC etc).

messageq

There are tradeoffs however:

  1. With the command pattern, if the desired state is large, then the message queue could reach its storage limits quickly. If the agent ends up discarding most commands, this can be quite inefficient.
  2. With the interrupt pattern, a message queue is not adding much value since the agent will talk directly to the orchestrator anyway.
  3. It is not trivial to operate / manage / monitor a persistent queue. Messages may need to be aggressively expired / purged, and the promise of persistence may not actually be realized. Depending on the scale of the automation, this overhead may not be worth the effort.
  4. With an “at most once” message queue, it could still lose messages. With  “at least once” semantics, the message queue could deliver multiple copies of the same message: the agent has to be able to determine if it is a duplicate. The orchestrator and agent still have to solve some of the end-to-end reliability problems.
  5. Delta computation is not solved by the message queue.

OpenStack (using RabbitMQ) and CloudFoundry (using NATS) have adopted message queues to communicate desired state from the orchestrator to the agent.  Apache CloudStack doesn’t have any explicit message queues, although if one digs deeply, there are command-based message queues simulated in the database and in memory.

Others solve the problem with a combination of interrupts and polling – interrupt to execute the change quickly, poll to recover from lost interrupts.

Kubernetes is one such framework. There are no message queues, and it uses an explicit interrupt-driven mechanism to communicate desired state from the orchestrator (the “API Server”) to its agents (called “controllers”).

Courtesy of Heptio

(Image courtesy: https://blog.heptio.com/core-kubernetes-jazz-improv-over-orchestration-a7903ea92ca)

Developers can use (but are not forced to use) a controller framework to write new controllers. An instance of a controller embeds an “Informer” whose responsibility is to watch for changes in the desired state and execute a controller function when there is a change. The Informer takes care of caching the desired state locally and computing the delta state when there are changes. The Informer leverages the “watch” mechanism in the Kubernetes API Server (an interrupt-like system that delivers a network notification when there is a change to a stored key or value). The deltas to the desired state are queued internally in the Informer’s memory. The Informer ensures the changes are executed in the correct order.

  • Desired states are versioned, so it is easier to decide to compute a delta, or to discard an interrupt.
  • The Informer can be configured to do a periodic full resync from the orchestrator (“API Server”) – this should take care of the problem of lost interrupts.
  • Apparently, there is no problem of the desired state being too large, so Kubernetes does not explicitly handle this issue.
  • It is not clear if the Informer attempts to rate-limit itself when there are excessive watches being triggered.
  • It is also not clear if at some point the Informer “fast-forwards” through its queue of changes.
  • The watches in the API Server use Etcd watches in turn. The watch server in the API server only maintains a limited set of watches received from Etcd and discards the oldest ones.
  • Etcd itself is a distributed data store that is more complex to operate than say, an SQL database. It appears that the API server hides the Etcd server from the rest of the system, and therefore Etcd could be replaced with some other store.

I wrote a Network Policy Controller for Kubernetes using this framework and it was the easiest integration I’ve written.

It is clear that the Kubernetes creators put some thought into the architecture, based on their experiences at Google. The Kubernetes design should inspire other orchestrator-writers, or perhaps, should be re-used for other datacenter automation purposes. A few issues to consider:

  • The agents (“controllers”) need direct network reachability to the API Server. This may not be possible in all scenarios, needing another level of indirection
  • The API server is not strictly an orchestrator, it is better described as a choreographer. I hope to describe this difference in a later blog post, but note that the API server never explicitly carries out a step-by-step flow of operations.

Quick Tip: Docker Machine on Apache CloudStack and XenServer

There is now Docker Machine support for Apache CloudStack. See @atsaki‘s work at https://github.com/atsaki/docker-machine-driver-cloudstack

docker-machine create -d cloudstack \
--cloudstack-api-url CLOUDSTACK_API_URL \
--cloudstack-api-key CLOUDSTACK_API_KEY \
--cloudstack-secret-key CLOUDSTACK_SECRET_KEY \
--cloudstack-template "Ubuntu Server 14.04" \
--cloudstack-zone "zone01" \
--cloudstack-service-offering "Small" \
--cloudstack-expunge \
docker-machine

Another way to do this is to launch your VM in CloudStack and then use the generic driver (assuming you have the private key from your sshkeypair):

docker-machine create -d generic \
--generic-ip-address=VM_IP\
--generic-ssh-key=SSH_PRIVATE_KEY  \

--generic-ssh-user=SSH_USER

This will ALSO work for plain old VMs created on XenServer  (which currently does not have a driver).

Bonus: in either case you can use docker-machine to set up a Docker Swarm by adding the parameters:

--swarm \
--swarm-discovery token://\

Farming your CloudStack cloud

A couple of years ago, I blogged about my prototype of StackMate, a tool and a service that interprets AWS CloudFormation-style templates and creates CloudStack resources. The idea was to provide an application management solution. I didn’t develop the idea beyond a working prototype. Terraform from Hashicorp is a similar idea, but with the ability to add extensions (providers)  to drive resource creation in different clouds and service providers. Fortunately Terraform is solid and widely used. Even better, Sander van Harmelen (@_svanharmelen_) has written a well-documented CloudStack provider.

Terraform templates have a different (but json-style) syntax than AWS Cloudformation, which among other things lets you add comments. Like StackMate, it figures out the order of resource creation by creating a dependency graph. You can also add explicity “depends_on” relationships. I played around with Terraform and created a couple of templates here:

https://github.com/chiradeep/terraform-cloudstack-examples

One template creates a VPC and 2 subnets and 2 VMS. The other template creates 2 isolated networks and a couple of VMs (one with nics on both networks).

Pull requests accepted.

While there are awesome services and products out there that can do similar things (RightScale, Scalr, Citrix Lifecycle Management), it is great to see something open sourced and community-driven.

How to manage a million firewalls – part 2

Continuing from my last post where I hinted about the big distributed systems problem involved in managing a CloudStack Basic Zone.

It helps to understand how CloudStack is architected at a high level. CloudStack is typically operated as a cluster of identical Java applications (called the “Management Server” or “MS”). There is a MySQL database that holds the desired state of the cloud. API calls arrive at a management server (through a load balancer). The management server uses the current state as stored in the MySQL database, computes/stores a new state and communicates any changes to the cloud infrastructure.

sg_groups_pptx8

In response to an API call, the management server(s) usually have to communicate with one or more hypervisors. For example, adding a rule to a security group (a single API call)  could involve communicating changes to dozens or hundreds of hypervisors. The job of communicating with the hypervisors is split (“sharded”) among the cluster members. For example if there’s 3000 hypervisors and 3 management servers, then each MS handles communications with 1000 hypervisors. If the API call arrives at MS ‘A’ and needs to update a hypervisor managed by MS ‘B’, then the communication is brokered through B.

Now updating a thousand firewalls  (remember, the firewalls are local to the hypervisor) in response to a single API call requires us to think about the API call semantics. Waiting for all 1000 firewalls to respond could take a very long time. The better approach is to return success to the API and work in the background to update the 1000 firewalls. It is also likely that the update is going to fail on a small percentage of the firewalls. The update could fail due to any number of problems: (transient) network problems between the MS and the hypervisor, a problem with the hypervisor hardware, etc.

This problem can be described in terms of the CAP theorem as well. A piece of state (the state of the security group) is being stored on a number of distributed machines (the hypervisors in this case). When there is a network partition (P), do we want the update to the state to be Consistent (every copy of the state is the same), or do we want the API to be Available (partition-tolerant).  Choosing Availability ensures that the API call never fails, regardless of the state of the infrastructure. But it also means that the state is potentially inconsistent across the infrastructure when there is a partition.

A lot of the problems with an inconsistent state can be hand-waved away1 since the default behavior of the firewall is to drop traffic. So if the firewall doesn’t get the new rule or the new IP address, it means that inconsistency is safe: we are not letting in traffic that we didn’t want to.

A common strategy in AP systems is to be eventually consistent. That is, at some undefined point in the future, every node in the distributed system will agree on the state. So, for example, the API call needs to update a hundred hypervisors, but only 95 of them are available. At some point in the future, the remaining 5 do become available and are updated to the correct state.

When a previously disconnected hypervisor reconnects to the MS cluster, it is easy to bring it up to date, since the authoritative state is stored in the MySQL database associated with the CloudStack MS cluster.

A different distributed systems problem is to deal with concurrent writes. Let’s say you send a hundred API calls in quick succession to the MS cluster to start a hundred VMs. Each VM creation leads to changes in many different VM firewalls. Not every API call lands on the same MS: the load balancer in front of the cluster will distribute it to all the machines in the cluster. Visualizing the timeline:

sg_groups_pptx9

A design goal is to push the updates to the VM firewalls as soon as possible (this is to minimize the window of inconsistency). So, as the API calls arrive, the MySQL database is updated and the new firewall states are computed and pushed to the hypervisors.

While MySQL concurrency primitives allow us to safely modify the database (effectively serializing the updates to the security groups), the order of updates to the database may not be the order of updates that flow to the hypervisor. For example, in the table above, the firewall state computed as a result of the API call at T=0 might arrive at the firewall for VM A after the firewall state computed at T=2. We cannot accept the “older” update.sg_groups_pptx10

The obvious2 solution is to insert the order of computation in the message (update) sent to the firewall. Every time an API call results in a change to the state of a VM firewall, we update a persistent sequence number associated with that VM. That sequence number is transmitted to the firewall along with the new state. If the firewall notices that the latest update received is “older” than the one it is has already processed, it just ignores it. In the figure above, the “red” update gets ignored.

An crucial point is that every update to the firewall has to contain the complete state: it cannot just be the delta from the previous state3.

The sequence number has to be stored on the hypervisor so that it can compare the received sequence number. The sequence number also optimizes the updates to hypervisors that reconnect after a network partition has healed: if the sequence number matches, then no updates are necessary.

Well, I’ve tried to keep this part under a thousand words. The architecture discussed here did not converge easily — there was a lot of mistakes and learning along the way. There is no way for other cloud / orchestration systems to re-use this code, however, I hope the reader will learn from my experience!


1. The only case to worry about is when rules are deleted: an inconsistent state potentially means we are allowing traffic when we didn’t intend to. In practice, rule deletes are a very small portion of the changes to security groups. Besides if the rule exists because it was intentionally created — it probably is OK to take a little time to delete it
2. Other (not-so-good) solutions involve locks per VM, and queues per VM
3. This is a common pattern in orchestrating distributed infrastructure

How to manage a million firewalls – part 1

In my last post I argued that security groups eliminate the need for network security devices in certain parts of the datacenter. The trick that enables this is the network firewall in the hypervisor. Each hypervisor hosts dozens or hundreds of VMs — and provides a firewall per VM. The figure below shows a typical setup, with Xen as the hypervisor. Ingress network traffic flows through the hardware into the control domain (“dom0”) where it is switched in software (so called virtual switch or vswitch) to the appropriate VM.

sg_groups_pptx7

The vswitch provides filtering functions that can block or allow certain types of traffic into the VM. Traffic between the VMs on the same hypervisor goes through the vswitch as well. The vswitch used in this design is the Linux Bridge; the firewall function is provided by netfilter ( “iptables”).

Security groups drop all traffic by default and only allow those configured by the rules. Suppose the red VMs in the figure (“Guest 1” and “Guest 4”) are in a security group “management”. We want to allow access to them from the subnet 192.168.1.0/24 on port 22 (ssh). The iptables rules might look like this:

iptables -A FORWARD -p tcp --dport 22 --src 192.168.1.0/24 -j ACCEPT 
iptables -A FORWARD -j DROP

Line 1 reads: for packets forwarded across the bridge (vswitch) that are destined for port 22, and are from source 192.168.1.0/24, allow (ACCEPT) them. Line 2 reads: DROP everything. The rules form a chain: packets traverse the chain until they match. (this is highly simplified: we want to match on the particular bridge ports that are connected to the VMs in question as well).

Now, let’s say we want to allow members of the ‘management’ group access their members over ssh as well. Let’s say there are 2 VMs in the group, with IPs of ‘A’ and ‘B’.  We calculate the membership and for each VM’s firewall, we write additional rules:

#for VM A
iptables -I FORWARD -p tcp --dport 22 --source B -j ACCEPT
#for VM B
iptables -I FORWARD -p tcp --dport 22 --source A -j ACCEPT

As we add more VMs to this security group, we have to add more such rules to each VM’s firewall. (A VM’s firewall is the chain of iptables rules that are specific to the VM).  If there are ‘N’ VMs in the security group, then each VM has N-1 iptables rules for just this one security group rule. Remember that a packet has to traverse the iptables rules until it matches or gets dropped at the end. Naturally each rule adds latency to a packet (at least to the connection-initiating ones).  After a certain number (few hundreds) of rules, the latency tends to go up hockey-stick fashion. In a large cloud, each VM could be in several security groups and each security group could have rules that interact with other security groups — easily leading to several hundred rules.

Aha, you might say, why not just summarize the N-1 source IPs and write a single rule like:

iptables -I FORWARD -p tcp --dport 22 --source <summary cidr> -j ACCEPT

Unfortunately, this isn’t possible since it is never guaranteed that the N-1 IPs will be in a single CIDR block. Fortunately this is a solved problem: we can use ipsets. We can add the N-1 IPs to a single named set (“ipset”). Then:

ipset -A mgmt <IP1>
ipset -A mgmt <IP2>
...
iptables -I FORWARD -p tcp --dport 22 -m set match-set mgmt src -j ACCEPT

IPSets matching is usually very fast and fixes the ‘scale up’ problem. In practice, I’ve seen it handle tens of thousands of IPs without significantly affecting latency or CPU load.

The second (perhaps more challenging) problem is that when the membership of a group changes, or a rule is added / deleted, a large number of VM firewalls have to be updated. Since we want to build a very large cloud, this usually means thousands or tens of thousands of hypervisors have to be updated with these changes. Let’s say in the single group/single rule example above, there are 500 VMs in the security groups. Adding a VM to the group means that 501 VM firewalls have to be updated. Adding a rule to the security group means that 500 VM firewalls have to be updated. In the worst case, the VMs are on 500 different hosts — making this a very big distributed systems problem.

If we consider a typical datacenter of 40,000 hypervisor hosts, with each hypervisor hosting an average of 25 VMs, this becomes the million firewall problem.

Part 2 will examine how this is solved in CloudStack’s Basic Zone.

CloudStack Basic Networking : deeper dive

In my last post I sang the praise of the simplicity of Basic Networking. There’s a few more details which even seasoned users of CloudStack may not be aware of:

  1. Security group rules are stateful. This means active connections enabled by the rules are tracked so that traffic can flow bidirectionally. Although UDP and ICMP are connectionless protocols, their “connection” is defined by the tuple. Stateful connection also has the somewhat surprising property that if you remove a rule, the existing connections enabled by rule continue to exist, until closed by either end of the connection. This is identical to AWS security groups behavior.
  2. Security group rules can allow access to VMs from other accounts: Suppose you have a shared monitoring service across accounts. The VMs in the monitoring service can belong to the cloud operator. Other tenants can allow access to them:
    • > authorize securitygroupingress securitygroupname=web account=operator usersecuritygrouplist=nagios,cacti protocol=tcp startport=12489 ...
  3. There is always a default security group: Just like EC2-classic, if you don’t place a VM in a security group, it gets placed in the default security group. Each account has its own default security group.
  4. Security group rules work between availability zones:  Security groups in an account are common across a region (multiple availability zones). Therefore, if the availability zones are routable (without NAT) to each other then the security groups work just as well between zones. This is similar to AWS EC2-classic security groups.
  5. Subnets are shared between accounts / VMs in a security group may not share a subnet. Although tenants cannot create or choose subnets in Basic networking, their VMs are placed in subnets (“Pods”) predefined by the cloud operator. The table below shows a sample of VMs belonging to two accounts spread between two subnets.
    • sg_groups_pptx4
  6. BUM traffic is silently dropped. Broadcast and multicast traffic is dropped at the VM egress to avoid attacks on other tenants in the same subnet. VMs cannot spoof their mac address either: unicast traffic with the wrong source mac is dropped as well.
  7. Anti-spoofing protection. VMs cannot spoof their mac address. VMs cannot send ARP responses for IP addresses they do not own. VMs cannot spoof DHCP server responses either. ARP is allowed only when the source MAC matches the VM’s assigned MAC. DHCP and DNS queries to the pod-local DHCP server are always allowed. If you run Wireshark/tcpdump within the VM you cannot see your neighbors traffic even though your NIC is set to promiscuous mode.
  8. Multiple IP addresses per VM: Once the VM is started you can request an additional IP for the VM (use the addIptoNic API).
  9. Live migration of the VM works as expected: When the operator migrates a VM, the security group rules move with the VM. Existing connections may get dropped during the migration.
  10. High Availability: As with any CloudStack installation, High Availability (aka Fast Restart) works as expected. When the VM moves to a different host, the rules move along with the VM.
  11. Effortless scaling: The largest CloudStack clouds (tens of thousands of nodes) use Basic networking. Just add more management servers.
  12. Available LBaaS: You can use a Citrix Netscaler to provide load balancing as well as Global Server Load Balancing (GSLB)
  13. Available Static NAT: You can use a Citrix Netscaler to provide Static NAT from a “public” IP to the VM IP.

There are limitations however when you use Basic Zone:

  1. Security groups function is only available on Citrix XenServer and KVM
  2. You can’t mix Advanced Networks and Basic Networks in the same availability zone, unlike AWS EC2
  3. You can’t add/remove security groups to a VM after it has been created. This is the same as EC2-classic
  4. No VPN functions are available.

The best way to deploy your Basic Zone is to engineer your physical network according to the same principles as web-scale operators. Read on

Back to Basics: CloudStack Basic Networking

The first choice to make when creating a zone in Apache CloudStack is the network type: basic or advanced. The blurb for “Advanced” promises “sophisticated network topologies”, while Basic promises “AWS-style networking”. Those who cut their teeth on the AWS cloud in 2008 may fondly remember what AWS now calls “EC2 classic platform“.

Platform Introduced In Description
EC2-Classic The original release of Amazon EC2 Your instances run in a single, flat network that you share with other customers.
EC2-VPC The original release of Amazon VPC Your instances run in a virtual private cloud (VPC) that’s logically isolated to your AWS account.

With a few differences, these map to CloudStack’s “Basic” and “Advanced” networking.

The fundamental network isolation technique in Basic zone is security groups. By default all network access to a VM is denied. When you launch an instance (VM), you deploy it in one or more security groups.

Using cloudmonkey:

> deploy virtualmachine securitygroupnames=management,web displayname=web0001 templateid=7464f3a6-ec56-4893-ac51-d120a71049dd serviceofferingid=48f813b7-2061-4270-93b2-c873a0fac336 zoneid=c78c2018-7181-4c7b-ab08-57204bc2eed3

Of course you have to create the security groups first:

> create securitygroup name=web
> create securitygroup name=management

Security groups are containers for firewall rules.

> authorize securitygroupingress  securitygroupname=web protocol=tcp startport=80 endport=80 cidrlist=0.0.0.0/0
> authorize securitygroupingress   securitygroupname=management protocol=tcp startport=22 endport=22 cidrlist=192.168.1.0/24

In a basic zone, all network access to a VM is denied by default. These two rules allow access to our VM on the HTTP port (80) from anywhere and on the SSH port (22) only from computers in the 192.168.1.0/24 subnet.

Let’s start another web VM with these security groups

> deploy virtualmachine securitygroupnames=management,web displayname=web0002 ...

We can log in to web0002 over ssh when our ssh client is in the 192.168.1.0/24 subnet. But when we try to login to web0001 from web0002 over ssh, we get denied, since neither of the ingress rules we wrote above allow that. We can fix that:

> authorize securitygroupingress   securitygroupname=management protocol=tcp startport=22 endport=22 securitygroupname=management

As long as the ssh client is on a VM in the management security group ssh access is allowed to any other VM in the management security group.

Let’s create some more:

Create appserver group and a db group

> create securitygroup name=appserver
> create securitygroup name=db

Let’s add these rules: Allow web VMs access to the app servers on port 8080. Allow app servers access to the DB VMs on the MySQL port (3306)

> authorize securitygroupingress   securitygroupname=appserver protocol=tcp startport=8080 endport=8080 securitygroupname=web
> authorize securitygroupingress   securitygroupname=db protocol=tcp startport=3306 endport=3306 securitygroupname=appserver

Deploy some virtual machines (instances) in these groups….

> deploy virtualmachine securitygroupnames=management,appserver displayname=app0001 ...
> deploy virtualmachine securitygroupnames=management,appserver displayname=app0002 ...
> deploy virtualmachine securitygroupnames=management,db displayname=db0001 ...

The network security architecture now looks like

sg_groups_pptx3

Pretty complicated, with just a handful of rules. The beauty of it is that it captures the intent accurately. After all, as a network admin you want to exactly say “allow app VMs access to the DB VMs on tcp port 3306”.

In a traditional network, you’d create subnets and insert security devices between them. On the security device you would have entered complicated ACLs. The ACLs may have to be changed any time you created / destroyed VMs. In a Basic Zone, once you define the groups and rules, everything is taken care of automatically. You can even edit the rules after the VMs are running. Let’s allow ICMP pings to all the VMs from our management subnet:

> authorize securitygroupingress securitygroupname=management protocol=icmp icmptype=-1 icmpcode=-1

To do the inverse:

> revoke securitygroupingress securitygroupname=management protocol=icmp ...

A significant difference from EC2-classic is that CloudStack allows you to create egress rules as well:

> authorize securitygroupegress securitygroupname=management protocol=icmp ...

This controls traffic out of the VMs. The egress rules only take effect the first time an egress rule is added to the security group. Once the first rule is added, egress is by default ‘deny all’ and only the specific egress rules allow traffic out of the VM.

Security group rules are stateful. This means that you don’t have to define a corresponding egress rule for every ingress rule. For example, when someone from the internet connects on port 80 to a web VM, response traffic (out of the web VM) associated with that connection is automatically allowed. Stateful connection tracking also apply to stateless protocols such as UDP and ICMP.

While ‘Basic’ Zone might seem well, basic, it offers powerful network isolation techniques that directly map your intent. The simple interface actually masks a sophisticated implementation which I hope to describe in a future post. I hope I have convinced you that ‘Basic’ is indeed sophisticated!

Read on for a deeper dive.

Do-it-yourself CloudWatch-style alarms using Riemann

AWS CloudWatch is a web service that enables the cloud user to collect, view, and analyze metrics about your AWS resources and applications. CloudWatch alarms send notifications or trigger autoscale actions based on rules defined by the user. For example, you can get an email from a CloudWatch alarm if the average latency of your web application stays over 2 ms for 3 consecutive 5 minute periods. The variables that CloudWatch let you set are the metric (average), the threshold (2ms) and the duration (3 periods). 

I became interested in replicating this capability after a recent discussion / proposal about auto scaling on the CloudStack mailing list. It was clearly some kind of Complex Event Processing (CEP) — and it appeared that there were a number of Open Source tools out there to do this. Among others (Storm, Esper), Riemann stood out as being built for purpose (for monitoring distributed systems) and offered the quickest way to try something out.

As the blurb says “Riemann aggregates events from your servers and applications with a powerful stream processing language”. You use any number of client libraries to send events to a Riemann server. You write rules in the stream processing language (actually Clojure) that the Riemann server executes on your event stream. The result of processing the rule can be another event or an email notification (or any number of actions such as send to pagerduty, logstash, graphite, etc). 

Installing Riemann is a breeze; the tricky part is writing the rules. If you are a Clojure noob like me, then it helps to browse through a Clojure guide first to get a feel for the syntax. You append your rules to the etc/riemann.config file. Let’s say we want to send an email whenever the average web application latency exceeds 6 ms over 3 periods of 3 seconds. Here’s one solution:

The keywords fixed-time-window, combine, where, email, etc are Clojure functions provided out of the box by Riemann. We can write our own function tc to make the above general purpose:

 We can make the function even more general purpose by letting the user pass in the summarizing function, and the comparison function as in:

;; left as an exercise
(tc 3 3 6.0 folds/std-dev <
(email "itguy@onemorecoolapp.net"))

Read that as : if the standard deviation of the metric falls below 6.0 for 3 consecutive windows of 3 seconds, send an email.

To test this functionality, here’s the client side script I used:

[~/riemann-0.2.4 ]$ irb -r riemann/client
1.9.3-p374 :001 > r = Riemann::Client.new
1.9.3-p374 :002 > t = [5.0, 5.0, 5.0, 4.0, 6.0, 5.0, 8.0, 9.0, 7.0, 7.0, 7.0, 7.0, 8.0, 6.0, 7.0, 5.0, 7.0, 6.0, 9.0, 3.0, 6.0]
1.9.3-p374 :003 > for i in (0..20)
1.9.3-p374 :004?> r << {host: "www1", service: "http req", metric: t[i], state: "ok", description: "request", tags:["http"]}
1.9.3-p374 :005?> sleep 1
1.9.3-p374 :006?> end

This generated this event:

{:service "http req", :state "threshold crossed", :description "service  crossed the value of 6.0 over 3 windows of 3 seconds", :metric 7.0, :tags ["http"], :time 1388992717, :ttl nil}

While this is more awesome than CloudWatch alarms (define your own summary, durations can be in second granularity vs minutes), it lacks the precise semantics of CloudWatch alarms:

  • The event doesn’t contain the actual measured summary in the periods the threshold was crossed.
  • There needs to be a state for the alarm (in CloudWatch it is INSUFFICIENT_DATA, ALARM, OK). 

This is indeed fairly easy to do in Riemann; hopefully I can get to work on this and update this space. 

This does not constitute a CloudWatch alarm service though: there is no web services API to insert metrics, CRUD alarm definitions or multi-tenancy. Perhaps one could use a Compojure-based API server and embed Riemann, but this is not terribly clear to me how at this point. Multi-tenancy, sharding, load balancing, high availability, etc are topics for a different post.

To complete the autoscale solution for CloudStack would also require Riemann to send notifications to the CloudStack management server about threshold crossings. Using CloStack, this shouldn’t be too hard. More about auto scale in a separate post.

The SDN behemoth hiding in plain sight

Hint: it is Amazon Web Services (AWS). Let’s see it in action:

Create a VPC with 2 tiers: one public (10.0.0.0/24)  and one private (10.0.1.0/24). These are connected via a router. Spin up 2 instances, one in each tier (10.0.0.33 and 10.0.1.168).

[ec2-user@ip-10-0-0-33 ~]$ ping 10.0.1.168 -c 2
    PING 10.0.1.168 (10.0.1.168) 56(84) bytes of data.
    64 bytes from 10.0.1.168: icmp_seq=1 ttl=64 time=1.35 ms
    64 bytes from 10.0.1.168: icmp_seq=2 ttl=64 time=0.412 ms

The sharp-eyed might have noticed an oddity: the hop count (ttl) does not decrement despite the presence of a routing hop between the 2 subnets. So, clearly it isn’t a commercial router or any standard networking stack. AWS calls it an ‘implied router‘. What is likely happening is that the VPC is realized by creation of overlays. When the ethernet frame (ping packet) exits 10.0.0.33, the virtual switch on the hypervisor sends the ethernet frame directly to the hypervisor that is running 10.0.1.168 inside the overlay. The vswitches do not bother to decrement the ttl since that will cause an expensive recomputation of checksums in the ip header. Note that AWS introduced this feature in 2009 — well before open vswitch even had its first release.

One could also argue that security groups and elastic ips at the scale of AWS’s datacenters also bear the hallmarks of Software Defined Networking : clearly it required them to innovate beyond standard vendor gear to provide such to-date-unmatched L3 services. And these date back to the early days of AWS (2007 and 2008).

It doesn’t stop there. Elastic Load Balancing (ELB) from AWS orchestrates virtual load balancers across availability zones — providing L7 software defined networking. And that’s from 2009.

Last month’s ONS 2013 conference saw Google and Microsoft (among others) presenting facts and figures about the use of Software Defined Networking in their data centers. Given the far-and-away leadership of AWS in the cloud computing infrastructure space, it is interesting (to me) that AWS is seldom mentioned in the same breath as the pixie dust du jour “SDN”.

In CloudStack’s native overlay controller, implied routers has been on the wish list for some time. CloudStack also has a scalable implementation of security groups (although scaling to 100s of thousands of hypervisors might take another round of engineering). CloudStack also uses virtual appliances to provide L4-L7 services such as site-to-site IPSec VPN and load balancing. While the virtual networking feature-set in CloudStack is close to that of AWS, clearly the AWS implementation is likely an order of magnitude bigger.

Stackmate : execute CloudFormation templates on CloudStack

AWS CloudFormation provides a simple-yet-powerful way to create ‘stacks’ of Cloud resources with a single call. The stack is described in a parameterized template file; creation of the stack is a simple matter of providing stack parameters. The template includes description of resources such as instances and security groups and provides a language to describe the ordering dependencies between the resources.

CloudStack doesn’t have any such tool (although it has been discussed). I was interested in exploring what it takes to provide stack creation services to a CloudStack deployment. As I read through various sample templates, it was clear that the structure of the template imposed an ordering of resources. For example, an ‘Instance’ resource might refer to a ‘SecurityGroup’ resource — this means that the security group has to be created successfully first before the instance can be created. Parsing the LAMP_Single_Instance.template for example, the following dependencies emerge:

WebServer depends on ["WebServerSecurityGroup", "WaitHandle"]
WaitHandle depends on []
WaitCondition depends on ["WaitHandle", "WebServer"]
WebServerSecurityGroup depends on []

This can be expressed as a Directed Acyclic Graph — what remains is to extract an ordering by performing a topological sort of the DAG. Once sorted, we need an execution engine that can take the schedule and execute it. Fortunately for me, Ruby has both: the TSort module performs topological sorts and the wonderful Ruote workflow engine by @jmettraux. Given the topological sort produced by TSort:

["WebServerSecurityGroup", "WaitHandle", "WebServer", "WaitCondition"]

You can write a process definition in Ruote:

Ruote.define my_stack do
  sequence
    WebServerSecurityGroup
    WaitHandle
    WebServer
    WaitCondition
  end
end

What remains is to implement the ‘participants‘ inside the process definition. For the most part it means making API calls to CloudStack to create the security group and instance. Here, the freshly minted CloudStack Ruby client from @chipchilders came in handy.

Stackmate is the result of this investigation — satisfyingly it is just 350 odd lines of ruby or so.

Ruote gives a nice split between defining the flow and the actual work items. We can ask Ruote to roll back (cancel) a process that has launched but not finished. We can create resources concurrently instead of in sequence. There’s a lot more workflow patterns here. The best part is that writing the participants is relatively trivial — just pick the right CloudStack API call to make.

While prototyping the design, I had to make a LOT of instance creation calls to my CloudStack installation — since I don’t have a ginormous cloud in back pocket, the excellent CloudStack simulator filled the role.

Next Steps

  • As it stands today  stackmate is executed on the command line and the workflow executes on the client side (server being CloudStack). This mode is good for CloudStack developers performing a pre-checkin test or QA developers developing automated tests. For a production CloudStack however,  stackmate needs to be a webservice and provide a user interface to launch CloudFormation templates.
  • TSort generates a topologically sorted sequence; this can be further optimized by executing some steps in parallel.
  • There’s more participants to be written to implement templates with VPC resources
  • Implement rollback and timeout

Advanced

Given ruote’s power, Ruby’s flexibility and the generality of CloudFormation templates:

  • We should be able to write CloudStack – specific templates (e.g, to take care of stuff like network offerings)
  • We should be able to execute AWS templates on clouds like Google Compute Engine
  • QA automation suddenly becomes a matter of writing templates rather than error-prone API call sequences
  • Templates can include custom resources such as 3rd party services: for example, after launching an instance, make an API call to a monitoring service to start monitoring port 80 on the instance, or for QA automation: make a call to a testing service
  • Even more general purpose complex workflows: can we add approval workflows, exception workflows and so on. For example, a manager has to approve before the stack can be launched. Or if the launch fails due to resource limits, trigger an approval workflow from the manager to temporarily bump up resource limits.