Category Archives: ec2

CloudStack Basic Networking : deeper dive

In my last post I sang the praise of the simplicity of Basic Networking. There’s a few more details which even seasoned users of CloudStack may not be aware of:

  1. Security group rules are stateful. This means active connections enabled by the rules are tracked so that traffic can flow bidirectionally. Although UDP and ICMP are connectionless protocols, their “connection” is defined by the tuple. Stateful connection also has the somewhat surprising property that if you remove a rule, the existing connections enabled by rule continue to exist, until closed by either end of the connection. This is identical to AWS security groups behavior.
  2. Security group rules can allow access to VMs from other accounts: Suppose you have a shared monitoring service across accounts. The VMs in the monitoring service can belong to the cloud operator. Other tenants can allow access to them:
    • > authorize securitygroupingress securitygroupname=web account=operator usersecuritygrouplist=nagios,cacti protocol=tcp startport=12489 ...
  3. There is always a default security group: Just like EC2-classic, if you don’t place a VM in a security group, it gets placed in the default security group. Each account has its own default security group.
  4. Security group rules work between availability zones:  Security groups in an account are common across a region (multiple availability zones). Therefore, if the availability zones are routable (without NAT) to each other then the security groups work just as well between zones. This is similar to AWS EC2-classic security groups.
  5. Subnets are shared between accounts / VMs in a security group may not share a subnet. Although tenants cannot create or choose subnets in Basic networking, their VMs are placed in subnets (“Pods”) predefined by the cloud operator. The table below shows a sample of VMs belonging to two accounts spread between two subnets.
    • sg_groups_pptx4
  6. BUM traffic is silently dropped. Broadcast and multicast traffic is dropped at the VM egress to avoid attacks on other tenants in the same subnet. VMs cannot spoof their mac address either: unicast traffic with the wrong source mac is dropped as well.
  7. Anti-spoofing protection. VMs cannot spoof their mac address. VMs cannot send ARP responses for IP addresses they do not own. VMs cannot spoof DHCP server responses either. ARP is allowed only when the source MAC matches the VM’s assigned MAC. DHCP and DNS queries to the pod-local DHCP server are always allowed. If you run Wireshark/tcpdump within the VM you cannot see your neighbors traffic even though your NIC is set to promiscuous mode.
  8. Multiple IP addresses per VM: Once the VM is started you can request an additional IP for the VM (use the addIptoNic API).
  9. Live migration of the VM works as expected: When the operator migrates a VM, the security group rules move with the VM. Existing connections may get dropped during the migration.
  10. High Availability: As with any CloudStack installation, High Availability (aka Fast Restart) works as expected. When the VM moves to a different host, the rules move along with the VM.
  11. Effortless scaling: The largest CloudStack clouds (tens of thousands of nodes) use Basic networking. Just add more management servers.
  12. Available LBaaS: You can use a Citrix Netscaler to provide load balancing as well as Global Server Load Balancing (GSLB)
  13. Available Static NAT: You can use a Citrix Netscaler to provide Static NAT from a “public” IP to the VM IP.

There are limitations however when you use Basic Zone:

  1. Security groups function is only available on Citrix XenServer and KVM
  2. You can’t mix Advanced Networks and Basic Networks in the same availability zone, unlike AWS EC2
  3. You can’t add/remove security groups to a VM after it has been created. This is the same as EC2-classic
  4. No VPN functions are available.

The best way to deploy your Basic Zone is to engineer your physical network according to the same principles as web-scale operators. Read on

Advertisements

How did they build that — EC2 Enhanced Networking

Among the flurry of new features introduced by AWS in 2013, is a performance enhancement known as ‘Enhanced Networking‘. According to the blurb: ” enhanced networking on your instance results in higher performance (packets per second), lower latency, and lower jitter’. The requirements are that you install an Intel 10GbE driver (ixgbef) in your instance and enable a feature called SR-IOV.

The AWS cloud is built around virtualization technology — specifically your instances are virtual machines running on top of a version of the open source Xen Hypervisor.
The hypervisor is what guarantees the isolation between my instance and your instance when they both run on the same set of CPUs.

The hypervisor intercepts all I/O from the virtual machine so that the virtual machine is abstracted from the hardware — this provides security as well as portability since the VM doesn’t need to care about the drivers for the I/O hardware. The VM sees a NIC that is software defined and as a result the hypervisor can inspect all traffic to and from the VM. -This allows AWS to control the networking traffic between the VM and the rest of the infrastructure. This is used to deliver features such as security groups and ACL.

The downside of processing all network traffic to/from the VM is that the host CPU cycles are consumed processing this traffic. This is quite a significant overhead compared to a bare-metal instance. The hypervisor needs to apply stateful firewall rules on every packet, switch the packet and encapsulate it. Some estimates put this overhead as high as 70% of the CPU available to the hypervisor (at 10 Gb/s). Software processing also introduces problems of noisy neighbors — variable jitter and high latency at 10Gbps are common.

Slide2

Fortunately, SR-IOV (Single Root IO Virtualization) provides a direct path for the VM to access the underlying hardware NIC. Bypassing the hypervisor leads to line-rate performance. Enhanced Networking takes advantage of this: in order to benefit from this, your AMI needs to have SR-IOV drivers installed in it.

Slide3

Great — but now that the hypervisor is out of the path, how does AWS provide software-defined features such as security groups and ACL? The current generation of SR-IOV NICs (AWS uses the Intel 82599 ) do not have stateful firewalls or the ability to have process large number of ACL. Furthermore, we know that AWS must be using some kind of encapsulation / tunnelling so that VPC are possible. The Intel 82599 does not provide encapsulation support.

The solution then would be to do the extra processing elsewhere — either off the host or in the host, using a co-processor. This schematic shows processing happening at the TOR switch. The drawback is that even intra-host traffic has to be tromboned via the TOR. Furthermore the switch now becomes a pretty big bottleneck and a failure in the switch could lead to several hosts losing network connectivity.Slide4

 

Using a co-processor would be the best solution. Tilera is one such processor that comes to mind. Since the Tilera provides general purpose processing cores, the encap/decap/filtering/stateful firewall processing could be done in software instead of ASICs or FPGAs.

Slide5

 

The software/hardware solution could allow AWS to introduce further innovations in its networking portfolio, including end-to-end encryption, IDS and IPS.

Disclaimer: I have no knowledge of AWS internals. This is just an exploration of “how did they build it?”.

Update: a confirmation of sorts on Werner Vogel’s blog: http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html

The SDN behemoth hiding in plain sight

Hint: it is Amazon Web Services (AWS). Let’s see it in action:

Create a VPC with 2 tiers: one public (10.0.0.0/24)  and one private (10.0.1.0/24). These are connected via a router. Spin up 2 instances, one in each tier (10.0.0.33 and 10.0.1.168).

[ec2-user@ip-10-0-0-33 ~]$ ping 10.0.1.168 -c 2
    PING 10.0.1.168 (10.0.1.168) 56(84) bytes of data.
    64 bytes from 10.0.1.168: icmp_seq=1 ttl=64 time=1.35 ms
    64 bytes from 10.0.1.168: icmp_seq=2 ttl=64 time=0.412 ms

The sharp-eyed might have noticed an oddity: the hop count (ttl) does not decrement despite the presence of a routing hop between the 2 subnets. So, clearly it isn’t a commercial router or any standard networking stack. AWS calls it an ‘implied router‘. What is likely happening is that the VPC is realized by creation of overlays. When the ethernet frame (ping packet) exits 10.0.0.33, the virtual switch on the hypervisor sends the ethernet frame directly to the hypervisor that is running 10.0.1.168 inside the overlay. The vswitches do not bother to decrement the ttl since that will cause an expensive recomputation of checksums in the ip header. Note that AWS introduced this feature in 2009 — well before open vswitch even had its first release.

One could also argue that security groups and elastic ips at the scale of AWS’s datacenters also bear the hallmarks of Software Defined Networking : clearly it required them to innovate beyond standard vendor gear to provide such to-date-unmatched L3 services. And these date back to the early days of AWS (2007 and 2008).

It doesn’t stop there. Elastic Load Balancing (ELB) from AWS orchestrates virtual load balancers across availability zones — providing L7 software defined networking. And that’s from 2009.

Last month’s ONS 2013 conference saw Google and Microsoft (among others) presenting facts and figures about the use of Software Defined Networking in their data centers. Given the far-and-away leadership of AWS in the cloud computing infrastructure space, it is interesting (to me) that AWS is seldom mentioned in the same breath as the pixie dust du jour “SDN”.

In CloudStack’s native overlay controller, implied routers has been on the wish list for some time. CloudStack also has a scalable implementation of security groups (although scaling to 100s of thousands of hypervisors might take another round of engineering). CloudStack also uses virtual appliances to provide L4-L7 services such as site-to-site IPSec VPN and load balancing. While the virtual networking feature-set in CloudStack is close to that of AWS, clearly the AWS implementation is likely an order of magnitude bigger.

Stackmate : execute CloudFormation templates on CloudStack

AWS CloudFormation provides a simple-yet-powerful way to create ‘stacks’ of Cloud resources with a single call. The stack is described in a parameterized template file; creation of the stack is a simple matter of providing stack parameters. The template includes description of resources such as instances and security groups and provides a language to describe the ordering dependencies between the resources.

CloudStack doesn’t have any such tool (although it has been discussed). I was interested in exploring what it takes to provide stack creation services to a CloudStack deployment. As I read through various sample templates, it was clear that the structure of the template imposed an ordering of resources. For example, an ‘Instance’ resource might refer to a ‘SecurityGroup’ resource — this means that the security group has to be created successfully first before the instance can be created. Parsing the LAMP_Single_Instance.template for example, the following dependencies emerge:

WebServer depends on ["WebServerSecurityGroup", "WaitHandle"]
WaitHandle depends on []
WaitCondition depends on ["WaitHandle", "WebServer"]
WebServerSecurityGroup depends on []

This can be expressed as a Directed Acyclic Graph — what remains is to extract an ordering by performing a topological sort of the DAG. Once sorted, we need an execution engine that can take the schedule and execute it. Fortunately for me, Ruby has both: the TSort module performs topological sorts and the wonderful Ruote workflow engine by @jmettraux. Given the topological sort produced by TSort:

["WebServerSecurityGroup", "WaitHandle", "WebServer", "WaitCondition"]

You can write a process definition in Ruote:

Ruote.define my_stack do
  sequence
    WebServerSecurityGroup
    WaitHandle
    WebServer
    WaitCondition
  end
end

What remains is to implement the ‘participants‘ inside the process definition. For the most part it means making API calls to CloudStack to create the security group and instance. Here, the freshly minted CloudStack Ruby client from @chipchilders came in handy.

Stackmate is the result of this investigation — satisfyingly it is just 350 odd lines of ruby or so.

Ruote gives a nice split between defining the flow and the actual work items. We can ask Ruote to roll back (cancel) a process that has launched but not finished. We can create resources concurrently instead of in sequence. There’s a lot more workflow patterns here. The best part is that writing the participants is relatively trivial — just pick the right CloudStack API call to make.

While prototyping the design, I had to make a LOT of instance creation calls to my CloudStack installation — since I don’t have a ginormous cloud in back pocket, the excellent CloudStack simulator filled the role.

Next Steps

  • As it stands today  stackmate is executed on the command line and the workflow executes on the client side (server being CloudStack). This mode is good for CloudStack developers performing a pre-checkin test or QA developers developing automated tests. For a production CloudStack however,  stackmate needs to be a webservice and provide a user interface to launch CloudFormation templates.
  • TSort generates a topologically sorted sequence; this can be further optimized by executing some steps in parallel.
  • There’s more participants to be written to implement templates with VPC resources
  • Implement rollback and timeout

Advanced

Given ruote’s power, Ruby’s flexibility and the generality of CloudFormation templates:

  • We should be able to write CloudStack – specific templates (e.g, to take care of stuff like network offerings)
  • We should be able to execute AWS templates on clouds like Google Compute Engine
  • QA automation suddenly becomes a matter of writing templates rather than error-prone API call sequences
  • Templates can include custom resources such as 3rd party services: for example, after launching an instance, make an API call to a monitoring service to start monitoring port 80 on the instance, or for QA automation: make a call to a testing service
  • Even more general purpose complex workflows: can we add approval workflows, exception workflows and so on. For example, a manager has to approve before the stack can be launched. Or if the launch fails due to resource limits, trigger an approval workflow from the manager to temporarily bump up resource limits.