Farming your CloudStack cloud

A couple of years ago, I blogged about my prototype of StackMate, a tool and a service that interprets AWS CloudFormation-style templates and creates CloudStack resources. The idea was to provide an application management solution. I didn’t develop the idea beyond a working prototype. Terraform from Hashicorp is a similar idea, but with the ability to add extensions (providers)  to drive resource creation in different clouds and service providers. Fortunately Terraform is solid and widely used. Even better, Sander van Harmelen (@_svanharmelen_) has written a well-documented CloudStack provider.

Terraform templates have a different (but json-style) syntax than AWS Cloudformation, which among other things lets you add comments. Like StackMate, it figures out the order of resource creation by creating a dependency graph. You can also add explicity “depends_on” relationships. I played around with Terraform and created a couple of templates here:

One template creates a VPC and 2 subnets and 2 VMS. The other template creates 2 isolated networks and a couple of VMs (one with nics on both networks).

Pull requests accepted.

While there are awesome services and products out there that can do similar things (RightScale, Scalr, Citrix Lifecycle Management), it is great to see something open sourced and community-driven.


How HP Labs nearly invented the cloud

On the heels of HP’s news of not-quite abandoning the Cloud, there is coverage of how AWS stole a march on Sun’s plans to provide compute-on-demand. The timeline for AWS starts late 2003 when an internal team in Amazon hatched a plan that among other things could offer virtual servers as a retail offering. Sun’s offering involved bare metal and running jobs, not virtual machines.

In a paper published in 2004 a group of researchers at HP Labs proposed what they called “SoftUDC” – a software-based utility data center. The project involved:

  • API access  to virtual resources
  • Virtualization using the Xen Hypervisor
  • Network virtualization using UDP overlays almost identical to VxLAN
  • Virtual Volumes accessible over the network from any virtual machine (like EBS)
  • “Gatekeeper” software in the hypervisor that provides the software and network virtualization
  • Multi-tier networking using subnetting and edge appliances (“VPC”)
  • Automated OS and application upgrades using the “cattle” technique (just replace instead of upgrade).
  • Control at the edge: firewalls, encryption and QoS guarantees provided at the hypervisor

Many of these ideas now seem “obvious”, but remember this was 2004. Many of these ideas were even implemented. For example, VNET is the name of the network virtualization stack / protocol. This was implemented as a driver in Xen dom0 that would take Ethernet frames exiting the hypervisor and encapsulate them in UDP frames.

Does this mean HP could have been the dominant IAAS player instead of AWS if it only had acted on its Labs innovation? Of course not. But, lets say in 2008 when AWS was a clear danger, it could’ve dug a little deeper inside its own technological inventory to produce a viable competitor early on.  Instead we got OpenStack.

Many of AWS’s core components are based on similar concepts: the Xen hypervisor, network virtualization, virtual volumes, security groups, and so on. No doubt they came up with these concepts on their own — more importantly they implemented them and had a strategy for building a business around it.

Who knows what innovations are cooking today in various big companies, only to get discarded as unviable ideas. This can be framed as the Innovator’s Dilemma as well.

How to manage a million firewalls – part 2

Continuing from my last post where I hinted about the big distributed systems problem involved in managing a CloudStack Basic Zone.

It helps to understand how CloudStack is architected at a high level. CloudStack is typically operated as a cluster of identical Java applications (called the “Management Server” or “MS”). There is a MySQL database that holds the desired state of the cloud. API calls arrive at a management server (through a load balancer). The management server uses the current state as stored in the MySQL database, computes/stores a new state and communicates any changes to the cloud infrastructure.


In response to an API call, the management server(s) usually have to communicate with one or more hypervisors. For example, adding a rule to a security group (a single API call)  could involve communicating changes to dozens or hundreds of hypervisors. The job of communicating with the hypervisors is split (“sharded”) among the cluster members. For example if there’s 3000 hypervisors and 3 management servers, then each MS handles communications with 1000 hypervisors. If the API call arrives at MS ‘A’ and needs to update a hypervisor managed by MS ‘B’, then the communication is brokered through B.

Now updating a thousand firewalls  (remember, the firewalls are local to the hypervisor) in response to a single API call requires us to think about the API call semantics. Waiting for all 1000 firewalls to respond could take a very long time. The better approach is to return success to the API and work in the background to update the 1000 firewalls. It is also likely that the update is going to fail on a small percentage of the firewalls. The update could fail due to any number of problems: (transient) network problems between the MS and the hypervisor, a problem with the hypervisor hardware, etc.

This problem can be described in terms of the CAP theorem as well. A piece of state (the state of the security group) is being stored on a number of distributed machines (the hypervisors in this case). When there is a network partition (P), do we want the update to the state to be Consistent (every copy of the state is the same), or do we want the API to be Available (partition-tolerant).  Choosing Availability ensures that the API call never fails, regardless of the state of the infrastructure. But it also means that the state is potentially inconsistent across the infrastructure when there is a partition.

A lot of the problems with an inconsistent state can be hand-waved away1 since the default behavior of the firewall is to drop traffic. So if the firewall doesn’t get the new rule or the new IP address, it means that inconsistency is safe: we are not letting in traffic that we didn’t want to.

A common strategy in AP systems is to be eventually consistent. That is, at some undefined point in the future, every node in the distributed system will agree on the state. So, for example, the API call needs to update a hundred hypervisors, but only 95 of them are available. At some point in the future, the remaining 5 do become available and are updated to the correct state.

When a previously disconnected hypervisor reconnects to the MS cluster, it is easy to bring it up to date, since the authoritative state is stored in the MySQL database associated with the CloudStack MS cluster.

A different distributed systems problem is to deal with concurrent writes. Let’s say you send a hundred API calls in quick succession to the MS cluster to start a hundred VMs. Each VM creation leads to changes in many different VM firewalls. Not every API call lands on the same MS: the load balancer in front of the cluster will distribute it to all the machines in the cluster. Visualizing the timeline:


A design goal is to push the updates to the VM firewalls as soon as possible (this is to minimize the window of inconsistency). So, as the API calls arrive, the MySQL database is updated and the new firewall states are computed and pushed to the hypervisors.

While MySQL concurrency primitives allow us to safely modify the database (effectively serializing the updates to the security groups), the order of updates to the database may not be the order of updates that flow to the hypervisor. For example, in the table above, the firewall state computed as a result of the API call at T=0 might arrive at the firewall for VM A after the firewall state computed at T=2. We cannot accept the “older” update.sg_groups_pptx10

The obvious2 solution is to insert the order of computation in the message (update) sent to the firewall. Every time an API call results in a change to the state of a VM firewall, we update a persistent sequence number associated with that VM. That sequence number is transmitted to the firewall along with the new state. If the firewall notices that the latest update received is “older” than the one it is has already processed, it just ignores it. In the figure above, the “red” update gets ignored.

An crucial point is that every update to the firewall has to contain the complete state: it cannot just be the delta from the previous state3.

The sequence number has to be stored on the hypervisor so that it can compare the received sequence number. The sequence number also optimizes the updates to hypervisors that reconnect after a network partition has healed: if the sequence number matches, then no updates are necessary.

Well, I’ve tried to keep this part under a thousand words. The architecture discussed here did not converge easily — there was a lot of mistakes and learning along the way. There is no way for other cloud / orchestration systems to re-use this code, however, I hope the reader will learn from my experience!

1. The only case to worry about is when rules are deleted: an inconsistent state potentially means we are allowing traffic when we didn’t intend to. In practice, rule deletes are a very small portion of the changes to security groups. Besides if the rule exists because it was intentionally created — it probably is OK to take a little time to delete it
2. Other (not-so-good) solutions involve locks per VM, and queues per VM
3. This is a common pattern in orchestrating distributed infrastructure

How to manage a million firewalls – part 1

In my last post I argued that security groups eliminate the need for network security devices in certain parts of the datacenter. The trick that enables this is the network firewall in the hypervisor. Each hypervisor hosts dozens or hundreds of VMs — and provides a firewall per VM. The figure below shows a typical setup, with Xen as the hypervisor. Ingress network traffic flows through the hardware into the control domain (“dom0”) where it is switched in software (so called virtual switch or vswitch) to the appropriate VM.


The vswitch provides filtering functions that can block or allow certain types of traffic into the VM. Traffic between the VMs on the same hypervisor goes through the vswitch as well. The vswitch used in this design is the Linux Bridge; the firewall function is provided by netfilter ( “iptables”).

Security groups drop all traffic by default and only allow those configured by the rules. Suppose the red VMs in the figure (“Guest 1” and “Guest 4”) are in a security group “management”. We want to allow access to them from the subnet on port 22 (ssh). The iptables rules might look like this:

iptables -A FORWARD -p tcp --dport 22 --src -j ACCEPT 
iptables -A FORWARD -j DROP

Line 1 reads: for packets forwarded across the bridge (vswitch) that are destined for port 22, and are from source, allow (ACCEPT) them. Line 2 reads: DROP everything. The rules form a chain: packets traverse the chain until they match. (this is highly simplified: we want to match on the particular bridge ports that are connected to the VMs in question as well).

Now, let’s say we want to allow members of the ‘management’ group access their members over ssh as well. Let’s say there are 2 VMs in the group, with IPs of ‘A’ and ‘B’.  We calculate the membership and for each VM’s firewall, we write additional rules:

#for VM A
iptables -I FORWARD -p tcp --dport 22 --source B -j ACCEPT
#for VM B
iptables -I FORWARD -p tcp --dport 22 --source A -j ACCEPT

As we add more VMs to this security group, we have to add more such rules to each VM’s firewall. (A VM’s firewall is the chain of iptables rules that are specific to the VM).  If there are ‘N’ VMs in the security group, then each VM has N-1 iptables rules for just this one security group rule. Remember that a packet has to traverse the iptables rules until it matches or gets dropped at the end. Naturally each rule adds latency to a packet (at least to the connection-initiating ones).  After a certain number (few hundreds) of rules, the latency tends to go up hockey-stick fashion. In a large cloud, each VM could be in several security groups and each security group could have rules that interact with other security groups — easily leading to several hundred rules.

Aha, you might say, why not just summarize the N-1 source IPs and write a single rule like:

iptables -I FORWARD -p tcp --dport 22 --source <summary cidr> -j ACCEPT

Unfortunately, this isn’t possible since it is never guaranteed that the N-1 IPs will be in a single CIDR block. Fortunately this is a solved problem: we can use ipsets. We can add the N-1 IPs to a single named set (“ipset”). Then:

ipset -A mgmt <IP1>
ipset -A mgmt <IP2>
iptables -I FORWARD -p tcp --dport 22 -m set match-set mgmt src -j ACCEPT

IPSets matching is usually very fast and fixes the ‘scale up’ problem. In practice, I’ve seen it handle tens of thousands of IPs without significantly affecting latency or CPU load.

The second (perhaps more challenging) problem is that when the membership of a group changes, or a rule is added / deleted, a large number of VM firewalls have to be updated. Since we want to build a very large cloud, this usually means thousands or tens of thousands of hypervisors have to be updated with these changes. Let’s say in the single group/single rule example above, there are 500 VMs in the security groups. Adding a VM to the group means that 501 VM firewalls have to be updated. Adding a rule to the security group means that 500 VM firewalls have to be updated. In the worst case, the VMs are on 500 different hosts — making this a very big distributed systems problem.

If we consider a typical datacenter of 40,000 hypervisor hosts, with each hypervisor hosting an average of 25 VMs, this becomes the million firewall problem.

Part 2 will examine how this is solved in CloudStack’s Basic Zone.

CloudStack Basic Networking : frictionless infrastructure

Continuing on my series exploring CloudStack’s Basic Zone:

Back to Basics

Basic Networking deep dive

The origin of the term ‘Basic’ lies in the elimination of switch and router configuration (primarily VLANs) that trips up many private cloud implementations. When the cloud operator creates a Basic Zone, she is asked to add Pods to the availability zone. Pods are containers for hypervisor hosts. sg_groups_pptx6

The figure above shows a section of a largish Basic Zone. The cloud operator has chosen to map each Rack to one Pod in CloudStack. Two Pods (Rack 1 and Rack 24) are shown with a sample of hypervisor hosts. VMs in three security groups are shown. As described in the previous post, the Pod subnets are defined by the cloud operator when she configures the Pods in CloudStack. The cloud user cannot chose the Pod (or subnet) when deploying a VM.

The firewalls shown in each host reflect the fact that the security group rules are enforced in the hypervisor firewall and not on any centralized or in-line appliance. CloudStack orchestrates the configuration of these firewalls (essentially iptables rules) every time a VM state changes or a security group is reconfigured using the user API.

Each Rack can have multiple uplinks to the L3 core. In fact this is the way data centers are architected for cloud and big data workloads. In a modern datacenter, the racks form the leafs and the L3 core consist of multiple spine routers. Each host has multiple network paths to every other host — at equal cost. CloudStack’s Basic Zone takes advantage of this any-to-any east-to-west bandwidth availability by not constraining the placement of VMs by networking location (although such a facility [placement groups] is available in CloudStack).


The cloud operator can still use VLANs for the rack-local links. For example, access VLAN 100 can be used in each  rack to connect to the hypervisors (the “guest network”), while the untagged interface (the “management network”) can be used to connect to the management interface of each hypervisor.

CloudStack automatically instantiates a virtual DHCP appliance (“virtual router”) in every Pod that serves DHCP and DNS to the VMs in the pod. The same appliance also serves as the userdata server and password change service. No guest traffic flows through the appliance. All traffic between VMs goes entirely over the physical infrastructure (leaf and spine routers). No network virtualization overhead is incurred. Broadcast storms, STP configurations, VLANs — all the traditional bugbears of a datacenter network are virtually eliminated.

When the physical layer of the datacenter network is architected right, Basic Zone provides tremendous scale and ease-of-use:

  1. Location-independent high bandwidth between any pair of VMs
  2. Elimination of expensive bandwidth sucking, latency-inducing security appliances
  3. Easy security configuration by end-users
  4. Elimination of VLAN-configuration friction
  5. Proven scale : tens of thousands of hypervisors
  6. Egress firewalls provide security for the legacy / non-cloud portions of the datacenter.
  7. The ideal architecture for your micro-services based applications, without the network virtualization overhead

CloudStack Basic Networking : deeper dive

In my last post I sang the praise of the simplicity of Basic Networking. There’s a few more details which even seasoned users of CloudStack may not be aware of:

  1. Security group rules are stateful. This means active connections enabled by the rules are tracked so that traffic can flow bidirectionally. Although UDP and ICMP are connectionless protocols, their “connection” is defined by the tuple. Stateful connection also has the somewhat surprising property that if you remove a rule, the existing connections enabled by rule continue to exist, until closed by either end of the connection. This is identical to AWS security groups behavior.
  2. Security group rules can allow access to VMs from other accounts: Suppose you have a shared monitoring service across accounts. The VMs in the monitoring service can belong to the cloud operator. Other tenants can allow access to them:
    • > authorize securitygroupingress securitygroupname=web account=operator usersecuritygrouplist=nagios,cacti protocol=tcp startport=12489 ...
  3. There is always a default security group: Just like EC2-classic, if you don’t place a VM in a security group, it gets placed in the default security group. Each account has its own default security group.
  4. Security group rules work between availability zones:  Security groups in an account are common across a region (multiple availability zones). Therefore, if the availability zones are routable (without NAT) to each other then the security groups work just as well between zones. This is similar to AWS EC2-classic security groups.
  5. Subnets are shared between accounts / VMs in a security group may not share a subnet. Although tenants cannot create or choose subnets in Basic networking, their VMs are placed in subnets (“Pods”) predefined by the cloud operator. The table below shows a sample of VMs belonging to two accounts spread between two subnets.
    • sg_groups_pptx4
  6. BUM traffic is silently dropped. Broadcast and multicast traffic is dropped at the VM egress to avoid attacks on other tenants in the same subnet. VMs cannot spoof their mac address either: unicast traffic with the wrong source mac is dropped as well.
  7. Anti-spoofing protection. VMs cannot spoof their mac address. VMs cannot send ARP responses for IP addresses they do not own. VMs cannot spoof DHCP server responses either. ARP is allowed only when the source MAC matches the VM’s assigned MAC. DHCP and DNS queries to the pod-local DHCP server are always allowed. If you run Wireshark/tcpdump within the VM you cannot see your neighbors traffic even though your NIC is set to promiscuous mode.
  8. Multiple IP addresses per VM: Once the VM is started you can request an additional IP for the VM (use the addIptoNic API).
  9. Live migration of the VM works as expected: When the operator migrates a VM, the security group rules move with the VM. Existing connections may get dropped during the migration.
  10. High Availability: As with any CloudStack installation, High Availability (aka Fast Restart) works as expected. When the VM moves to a different host, the rules move along with the VM.
  11. Effortless scaling: The largest CloudStack clouds (tens of thousands of nodes) use Basic networking. Just add more management servers.
  12. Available LBaaS: You can use a Citrix Netscaler to provide load balancing as well as Global Server Load Balancing (GSLB)
  13. Available Static NAT: You can use a Citrix Netscaler to provide Static NAT from a “public” IP to the VM IP.

There are limitations however when you use Basic Zone:

  1. Security groups function is only available on Citrix XenServer and KVM
  2. You can’t mix Advanced Networks and Basic Networks in the same availability zone, unlike AWS EC2
  3. You can’t add/remove security groups to a VM after it has been created. This is the same as EC2-classic
  4. No VPN functions are available.

The best way to deploy your Basic Zone is to engineer your physical network according to the same principles as web-scale operators. Read on

Back to Basics: CloudStack Basic Networking

The first choice to make when creating a zone in Apache CloudStack is the network type: basic or advanced. The blurb for “Advanced” promises “sophisticated network topologies”, while Basic promises “AWS-style networking”. Those who cut their teeth on the AWS cloud in 2008 may fondly remember what AWS now calls “EC2 classic platform“.

Platform Introduced In Description
EC2-Classic The original release of Amazon EC2 Your instances run in a single, flat network that you share with other customers.
EC2-VPC The original release of Amazon VPC Your instances run in a virtual private cloud (VPC) that’s logically isolated to your AWS account.

With a few differences, these map to CloudStack’s “Basic” and “Advanced” networking.

The fundamental network isolation technique in Basic zone is security groups. By default all network access to a VM is denied. When you launch an instance (VM), you deploy it in one or more security groups.

Using cloudmonkey:

> deploy virtualmachine securitygroupnames=management,web displayname=web0001 templateid=7464f3a6-ec56-4893-ac51-d120a71049dd serviceofferingid=48f813b7-2061-4270-93b2-c873a0fac336 zoneid=c78c2018-7181-4c7b-ab08-57204bc2eed3

Of course you have to create the security groups first:

> create securitygroup name=web
> create securitygroup name=management

Security groups are containers for firewall rules.

> authorize securitygroupingress  securitygroupname=web protocol=tcp startport=80 endport=80 cidrlist=
> authorize securitygroupingress   securitygroupname=management protocol=tcp startport=22 endport=22 cidrlist=

In a basic zone, all network access to a VM is denied by default. These two rules allow access to our VM on the HTTP port (80) from anywhere and on the SSH port (22) only from computers in the subnet.

Let’s start another web VM with these security groups

> deploy virtualmachine securitygroupnames=management,web displayname=web0002 ...

We can log in to web0002 over ssh when our ssh client is in the subnet. But when we try to login to web0001 from web0002 over ssh, we get denied, since neither of the ingress rules we wrote above allow that. We can fix that:

> authorize securitygroupingress   securitygroupname=management protocol=tcp startport=22 endport=22 securitygroupname=management

As long as the ssh client is on a VM in the management security group ssh access is allowed to any other VM in the management security group.

Let’s create some more:

Create appserver group and a db group

> create securitygroup name=appserver
> create securitygroup name=db

Let’s add these rules: Allow web VMs access to the app servers on port 8080. Allow app servers access to the DB VMs on the MySQL port (3306)

> authorize securitygroupingress   securitygroupname=appserver protocol=tcp startport=8080 endport=8080 securitygroupname=web
> authorize securitygroupingress   securitygroupname=db protocol=tcp startport=3306 endport=3306 securitygroupname=appserver

Deploy some virtual machines (instances) in these groups….

> deploy virtualmachine securitygroupnames=management,appserver displayname=app0001 ...
> deploy virtualmachine securitygroupnames=management,appserver displayname=app0002 ...
> deploy virtualmachine securitygroupnames=management,db displayname=db0001 ...

The network security architecture now looks like


Pretty complicated, with just a handful of rules. The beauty of it is that it captures the intent accurately. After all, as a network admin you want to exactly say “allow app VMs access to the DB VMs on tcp port 3306”.

In a traditional network, you’d create subnets and insert security devices between them. On the security device you would have entered complicated ACLs. The ACLs may have to be changed any time you created / destroyed VMs. In a Basic Zone, once you define the groups and rules, everything is taken care of automatically. You can even edit the rules after the VMs are running. Let’s allow ICMP pings to all the VMs from our management subnet:

> authorize securitygroupingress securitygroupname=management protocol=icmp icmptype=-1 icmpcode=-1

To do the inverse:

> revoke securitygroupingress securitygroupname=management protocol=icmp ...

A significant difference from EC2-classic is that CloudStack allows you to create egress rules as well:

> authorize securitygroupegress securitygroupname=management protocol=icmp ...

This controls traffic out of the VMs. The egress rules only take effect the first time an egress rule is added to the security group. Once the first rule is added, egress is by default ‘deny all’ and only the specific egress rules allow traffic out of the VM.

Security group rules are stateful. This means that you don’t have to define a corresponding egress rule for every ingress rule. For example, when someone from the internet connects on port 80 to a web VM, response traffic (out of the web VM) associated with that connection is automatically allowed. Stateful connection tracking also apply to stateless protocols such as UDP and ICMP.

While ‘Basic’ Zone might seem well, basic, it offers powerful network isolation techniques that directly map your intent. The simple interface actually masks a sophisticated implementation which I hope to describe in a future post. I hope I have convinced you that ‘Basic’ is indeed sophisticated!

Read on for a deeper dive.

99 problems in my private cloud and networking is most of them

The state of private cloud is dire according to a number of pundits. Twitter’s de-facto cloud prognosticator warns: Do not build private clouds. Matt Asay declares private cloud to be a failure for a number of reasons, including the failure to change the way enterprises do business:

Private cloud lets enterprises pretend to be innovative, embracing pseudo-cloud computing even as they dress up antiquated IT in fancy nomenclature. But when Gartner surveyed enterprises that had deployed private clouds, only 5% claimed success

But he also lays blame on the most-hyped infrastructure technology of the past few years, OpenStack:

An increasing number of contributing companies are trying to steer OpenStack in highly divergent directions, making it hard for the newbie to figure out how to successfully use OpenStack. No wonder, then, that Joyent’s Bryan Cantrill hinted that the widespread failure of private clouds may be “due to OpenStack complexities.”

A large part of these complexities appear to be networking related:

No wonder most touted OpenStack successes have bespoke network architectures:

  • @WalmartLabs says they have 100k cores running, but

SDN is going to be our next step. Network is one area we need to put a lot of effort into. When you grow horizontally, you add compute, and the network is kind of the bottleneck for everything. That’s an area where you want more redundancy

  • Paypal runs a large (8500 servers) cloud, but uses VMWare’s NVP for networking
  • CERN runs a large OpenStack cloud but uses a custom network driver

In a different article, Matt Asay even cites industry insiders to state that OpenStack’s “dirty little secret” is that it doesn’t scale, largely due to broken networking.

In fact, as I’ve heard from a range of companies, a dirty secret of OpenStack is that it starts to fall over and can’t scale past 30 nodes if you are running plain vanilla main trunk OpenStack software

Frustrated cloud operators might look at the newest darling on the block to solve their complexities: Docker. At least it has a single voice and the much vaunted BDFL. Things should be better right? Well, not yet. Hopes are high, but both networking and storage are pretty much “roll your own”. There’s exotic options like Kubernetes, which pretty much only work in public clouds, SDN-like solutions (this, this, this, and more) and patchworks of proxies. Like the network operator needs yet another SDN solution rammed down her throat.

There is a common strand here: tone-deafness. Are folks thinking about how network operators really work? This lack of empathy sticks out like a sore thumb. If the solutions offered a genuine improvement to the state of networking then operators might take a chance at using something new. Network operators hoping to emulate web-scale operators such as AWS, Google and Facebook face a daunting task as well: private cloud solutions often add gratuitous complexity and take away none.

My favorite cloud software Apache CloudStack is not immune to these problems. The out-of-the-box network configuration is often a suboptimal choice for private clouds. Scalable solutions such as Basic Networking are ignored because, well, who wants something “basic”? In future posts, I hope to outline how private cloud operators can take architect their CloudStack networks for a better, scalable experience.

How dual-speed IT impacts private cloud architecture

An intriguing insight / hypothesis from Gartner is that IT can be more successful when it clearly demarcates ‘agile’ IT and ‘traditional IT”. According to Lydia Leong:

Traditional IT is focused on “doing IT right”, with a strong emphasis on efficiency and safety, approval-based governance and price-for-performance. Agile IT is focused on “doing IT fast”, supporting prototyping and iterative development, rapid delivery, continuous and process-based governance, and value to the business (being business-centric and close to the customer)

The idea is that “agile” IT is better served with cloud : either IAAS or PaaS while traditional IT could stick to their knitting and do business as usual. At some point, agile IT figures out how to do ‘cloud’ right and helps the other gang adopt the cloud. Of course, there’s dissent: Simon Wardley argues for trimodal IT, with the middle group mediating the extremes.

Lydia goes on to argue that:

Bimodal IT also implies that hybrid IT is really simply the peaceful coexistence of non-cloud and cloud application components — not the idea that it’s one set of management tools that sit on top of all environments.

Non-cloud application components are (my guess here) the domain of traditional IT, cloud application components are the domain of agile IT. The dichotomy also argues for 2 types of infrastructure: cloud and non-cloud.

A somewhat unrelated insight comes from Geoffrey Moore, that there’s 2 kinds of IT systems: Systems of Record (“Enterprise IT 1.0”) and Systems of Engagement (“the next stage of IT”). Systems of Record are:

global information systems that capture every dimension of our commercial landscape, from financial transactions to human resources to order processing to inventory management to customer relationship management to supply chain management to product lifecycle management, and on and on

Systems of engagement by contrast:

the focus instead will be on empowering the middle of the enterprise to communicate and collaborate across business boundaries, global time zones, and language and culture barriers, using next-generation IT applications and infrastructure adapted from the consumer space.

Systems of Record are the cost of doing business. They need to be highly optimized, low risk, rock solid and rely on a processes such as six sigma to deliver the quality and efficiency demanded by business. It is unlikely that these will be moved into the cloud in the near future.

The hypothesis (mine) here is that the systems of record are hosted on traditional IT / non-cloud infrastructure and private/public cloud hosts the systems of engagement.

Obviously the newer systems of engagement whether deployed on private clouds or public clouds may need access to the data held by the system of record.

If you have a private cloud for agile/systems of engagement, then the interaction looks like this:

Slide1If you use a public cloud for your systems of engagement, then it looks like:


Yet another way to look at it might be the “pets vs. cattle” schema.


Public clouds make this interconnection “easy” by providing required infrastructure. For example, AWS provides VPN Gateway and AWS Direct Connect. These facilities allow applications hosted on instances in the AWS cloud access resources that are “on-prem” (and vice-versa).

Theoretically the interconnect should be dead simple in the private cloud case. After all both parts of the infrastructure are hosted on the same local network infrastructure; presumably a single administrative domain. Complications can arise from:

  1. Business needs
  2. Artifacts of the private cloud implementation.

First the business needs: integrating systems of record and systems of engagement often involves crossing security boundaries: the former is guarded like Fort Knox; the latter has more fluid requirements. So, the solution might involve for example, inserting security devices in the path.


The challenge is that the system on the right is extremely fluid: the network is constantly being reconfigured. Each change in the right might require changes in the security devices. The required level of network automation (to automate the security policy) is an unseen cost of implementing this architecture.

Private cloud networking brings its own complexities: it is often the most challenging part of implementing a private cloud. While the private cloud software stack might provide a solution that works within the cloud, it won’t provide a solution for the security policy automation problem mentioned above.

Bimodal IT is an interesting idea but can lead to ‘gaps’ between the modes, including the infrastructure domain. In a future post I hope to convince you that Apache CloudStack has some tricks up it’s sleeve to solve some of these problems.

How did they build that — EC2 Enhanced Networking

Among the flurry of new features introduced by AWS in 2013, is a performance enhancement known as ‘Enhanced Networking‘. According to the blurb: ” enhanced networking on your instance results in higher performance (packets per second), lower latency, and lower jitter’. The requirements are that you install an Intel 10GbE driver (ixgbef) in your instance and enable a feature called SR-IOV.

The AWS cloud is built around virtualization technology — specifically your instances are virtual machines running on top of a version of the open source Xen Hypervisor.
The hypervisor is what guarantees the isolation between my instance and your instance when they both run on the same set of CPUs.

The hypervisor intercepts all I/O from the virtual machine so that the virtual machine is abstracted from the hardware — this provides security as well as portability since the VM doesn’t need to care about the drivers for the I/O hardware. The VM sees a NIC that is software defined and as a result the hypervisor can inspect all traffic to and from the VM. -This allows AWS to control the networking traffic between the VM and the rest of the infrastructure. This is used to deliver features such as security groups and ACL.

The downside of processing all network traffic to/from the VM is that the host CPU cycles are consumed processing this traffic. This is quite a significant overhead compared to a bare-metal instance. The hypervisor needs to apply stateful firewall rules on every packet, switch the packet and encapsulate it. Some estimates put this overhead as high as 70% of the CPU available to the hypervisor (at 10 Gb/s). Software processing also introduces problems of noisy neighbors — variable jitter and high latency at 10Gbps are common.


Fortunately, SR-IOV (Single Root IO Virtualization) provides a direct path for the VM to access the underlying hardware NIC. Bypassing the hypervisor leads to line-rate performance. Enhanced Networking takes advantage of this: in order to benefit from this, your AMI needs to have SR-IOV drivers installed in it.


Great — but now that the hypervisor is out of the path, how does AWS provide software-defined features such as security groups and ACL? The current generation of SR-IOV NICs (AWS uses the Intel 82599 ) do not have stateful firewalls or the ability to have process large number of ACL. Furthermore, we know that AWS must be using some kind of encapsulation / tunnelling so that VPC are possible. The Intel 82599 does not provide encapsulation support.

The solution then would be to do the extra processing elsewhere — either off the host or in the host, using a co-processor. This schematic shows processing happening at the TOR switch. The drawback is that even intra-host traffic has to be tromboned via the TOR. Furthermore the switch now becomes a pretty big bottleneck and a failure in the switch could lead to several hosts losing network connectivity.Slide4


Using a co-processor would be the best solution. Tilera is one such processor that comes to mind. Since the Tilera provides general purpose processing cores, the encap/decap/filtering/stateful firewall processing could be done in software instead of ASICs or FPGAs.



The software/hardware solution could allow AWS to introduce further innovations in its networking portfolio, including end-to-end encryption, IDS and IPS.

Disclaimer: I have no knowledge of AWS internals. This is just an exploration of “how did they build it?”.