Category Archives: sdn

Automated configuration of NetScaler Loadbalancer for Kubernetes, Mesos and Docker Swarm

There are an incredible number of Cluster Managers for containerized workloads. The top clustered container managers include Google’s Kubernetes, the Marathon framework on Apache Mesos and Docker Swarm. While these managers offer powerful scheduling and autonomic capabilities, integration with external load balancers is often left as an exercise for the user. Since load balancers are essential components in a horizontally-scaled microservice this omission can impede the roll-out of your chosen container manager. Further, since a microservice architecture demands rapid deployment, any solution has to be able to keep up with the changes in the topology and structure of the microservice.

Citrix NetScaler is an application delivery controller widely used in load balancing applications at several Web-scale companies. This blog post describes Nitrox, a containerized application that can work with Docker Swarm or Kubernetes or Marathon (on Apache Mesos) to automatically reconfigure a Citrix NetScaler instance in response to application events such as deployment, rolling upgrades and auto/manual scale.

The figure below shows a scaling event that causes the number of backend containers for application α to grow from 4 to 6. The endpoints of the additional containers have to be provisioned into the NetScaler as a result.

scaleout

All cluster managers offer an event stream of application lifecycle events. In this case, Docker Swarm sends a container start event; Marathon sends a status_update_event with a taskStatus field and Kubernetes sends an endpoint update event. The job of Nitrox is to listen to these event streams and react to the events. In most cases this fires a query back to the cluster manager to obtain the new list of endpoints. This list of endpoints is then configured into NetScaler.

nitrox

The reconfiguration of Netscaler is idempotent and complete: if the endpoint already exists in the Netscaler configuration, it isn’t re-done. This prevents unnecessary reconfiguration. The set of endpoints sent to Netscaler is not-incremental: the entire set is sent. This overcomes any problem with missed / dropped events and causes the Netscaler configuration to be eventually consistent.

Another choice made was to let the application / NetScaler admin provision the frontend details of the application. The front-end has myriad options such as lbmethod, persistence and stickiness. It is likely that each application has different needs for this configuration; also it is assumed to be chosen once and not dependent on size and scope of the backends.

You can find Nitrox  code and instructions here. The container is available on Docker Hub as chiradeeptest/nitrox. The containerized Nitrox can be scheduled like a regular workload on each of the container managers: docker run on Docker Swarm, an app on Marathon and a replication controller on Kubernetes.

Implementation Notes

Kubernetes ( and optionally: Docker Swarm) requires virtual networking (Kubernetes is usually  used with flannel). Therefore the container endpoints are endpoints on a virtual network. Since the NetScaler doesn’t participate in the virtual network (consider a non-virtualized NetScaler), this becomes a problem. For Docker Swarm, Nitrox assumes that bridged networking is used.

For Kubernetes, it is assumed that the service (app) being load balanced is configured to use NodePort style of exposing the service to external access. Kubernetes chooses a random port and exposes this port on every node in the cluster. Each node has a proxy that can provide access on this port. The proxy load balances the ingress traffic to each backend pod (container). One strategy then would be to simply configure the NetScaler to load balance to every node in the cluster. However, even if there are say 2 containers in the application but there are 50 nodes, then the Netscaler would needlessly send the traffic to many nodes. To make this more efficient, Nitrox figures out the list of nodes that the containers are actually running on and provisions these endpoints on the NetScaler.

 

CloudStack Basic Networking : frictionless infrastructure

Continuing on my series exploring CloudStack’s Basic Zone:

Back to Basics

Basic Networking deep dive

The origin of the term ‘Basic’ lies in the elimination of switch and router configuration (primarily VLANs) that trips up many private cloud implementations. When the cloud operator creates a Basic Zone, she is asked to add Pods to the availability zone. Pods are containers for hypervisor hosts. sg_groups_pptx6

The figure above shows a section of a largish Basic Zone. The cloud operator has chosen to map each Rack to one Pod in CloudStack. Two Pods (Rack 1 and Rack 24) are shown with a sample of hypervisor hosts. VMs in three security groups are shown. As described in the previous post, the Pod subnets are defined by the cloud operator when she configures the Pods in CloudStack. The cloud user cannot chose the Pod (or subnet) when deploying a VM.

The firewalls shown in each host reflect the fact that the security group rules are enforced in the hypervisor firewall and not on any centralized or in-line appliance. CloudStack orchestrates the configuration of these firewalls (essentially iptables rules) every time a VM state changes or a security group is reconfigured using the user API.

Each Rack can have multiple uplinks to the L3 core. In fact this is the way data centers are architected for cloud and big data workloads. In a modern datacenter, the racks form the leafs and the L3 core consist of multiple spine routers. Each host has multiple network paths to every other host — at equal cost. CloudStack’s Basic Zone takes advantage of this any-to-any east-to-west bandwidth availability by not constraining the placement of VMs by networking location (although such a facility [placement groups] is available in CloudStack).

networking_in_the_cloud_age_lisa2013_pptx

The cloud operator can still use VLANs for the rack-local links. For example, access VLAN 100 can be used in each  rack to connect to the hypervisors (the “guest network”), while the untagged interface (the “management network”) can be used to connect to the management interface of each hypervisor.

CloudStack automatically instantiates a virtual DHCP appliance (“virtual router”) in every Pod that serves DHCP and DNS to the VMs in the pod. The same appliance also serves as the userdata server and password change service. No guest traffic flows through the appliance. All traffic between VMs goes entirely over the physical infrastructure (leaf and spine routers). No network virtualization overhead is incurred. Broadcast storms, STP configurations, VLANs — all the traditional bugbears of a datacenter network are virtually eliminated.

When the physical layer of the datacenter network is architected right, Basic Zone provides tremendous scale and ease-of-use:

  1. Location-independent high bandwidth between any pair of VMs
  2. Elimination of expensive bandwidth sucking, latency-inducing security appliances
  3. Easy security configuration by end-users
  4. Elimination of VLAN-configuration friction
  5. Proven scale : tens of thousands of hypervisors
  6. Egress firewalls provide security for the legacy / non-cloud portions of the datacenter.
  7. The ideal architecture for your micro-services based applications, without the network virtualization overhead

99 problems in my private cloud and networking is most of them

The state of private cloud is dire according to a number of pundits. Twitter’s de-facto cloud prognosticator warns: Do not build private clouds. Matt Asay declares private cloud to be a failure for a number of reasons, including the failure to change the way enterprises do business:

Private cloud lets enterprises pretend to be innovative, embracing pseudo-cloud computing even as they dress up antiquated IT in fancy nomenclature. But when Gartner surveyed enterprises that had deployed private clouds, only 5% claimed success

But he also lays blame on the most-hyped infrastructure technology of the past few years, OpenStack:

An increasing number of contributing companies are trying to steer OpenStack in highly divergent directions, making it hard for the newbie to figure out how to successfully use OpenStack. No wonder, then, that Joyent’s Bryan Cantrill hinted that the widespread failure of private clouds may be “due to OpenStack complexities.”

A large part of these complexities appear to be networking related:

No wonder most touted OpenStack successes have bespoke network architectures:

  • @WalmartLabs says they have 100k cores running, but

SDN is going to be our next step. Network is one area we need to put a lot of effort into. When you grow horizontally, you add compute, and the network is kind of the bottleneck for everything. That’s an area where you want more redundancy

  • Paypal runs a large (8500 servers) cloud, but uses VMWare’s NVP for networking
  • CERN runs a large OpenStack cloud but uses a custom network driver

In a different article, Matt Asay even cites industry insiders to state that OpenStack’s “dirty little secret” is that it doesn’t scale, largely due to broken networking.

In fact, as I’ve heard from a range of companies, a dirty secret of OpenStack is that it starts to fall over and can’t scale past 30 nodes if you are running plain vanilla main trunk OpenStack software

Frustrated cloud operators might look at the newest darling on the block to solve their complexities: Docker. At least it has a single voice and the much vaunted BDFL. Things should be better right? Well, not yet. Hopes are high, but both networking and storage are pretty much “roll your own”. There’s exotic options like Kubernetes, which pretty much only work in public clouds, SDN-like solutions (this, this, this, and more) and patchworks of proxies. Like the network operator needs yet another SDN solution rammed down her throat.

There is a common strand here: tone-deafness. Are folks thinking about how network operators really work? This lack of empathy sticks out like a sore thumb. If the solutions offered a genuine improvement to the state of networking then operators might take a chance at using something new. Network operators hoping to emulate web-scale operators such as AWS, Google and Facebook face a daunting task as well: private cloud solutions often add gratuitous complexity and take away none.

My favorite cloud software Apache CloudStack is not immune to these problems. The out-of-the-box network configuration is often a suboptimal choice for private clouds. Scalable solutions such as Basic Networking are ignored because, well, who wants something “basic”? In future posts, I hope to outline how private cloud operators can take architect their CloudStack networks for a better, scalable experience.

How dual-speed IT impacts private cloud architecture

An intriguing insight / hypothesis from Gartner is that IT can be more successful when it clearly demarcates ‘agile’ IT and ‘traditional IT”. According to Lydia Leong:

Traditional IT is focused on “doing IT right”, with a strong emphasis on efficiency and safety, approval-based governance and price-for-performance. Agile IT is focused on “doing IT fast”, supporting prototyping and iterative development, rapid delivery, continuous and process-based governance, and value to the business (being business-centric and close to the customer)

The idea is that “agile” IT is better served with cloud : either IAAS or PaaS while traditional IT could stick to their knitting and do business as usual. At some point, agile IT figures out how to do ‘cloud’ right and helps the other gang adopt the cloud. Of course, there’s dissent: Simon Wardley argues for trimodal IT, with the middle group mediating the extremes.

Lydia goes on to argue that:

Bimodal IT also implies that hybrid IT is really simply the peaceful coexistence of non-cloud and cloud application components — not the idea that it’s one set of management tools that sit on top of all environments.

Non-cloud application components are (my guess here) the domain of traditional IT, cloud application components are the domain of agile IT. The dichotomy also argues for 2 types of infrastructure: cloud and non-cloud.

A somewhat unrelated insight comes from Geoffrey Moore, that there’s 2 kinds of IT systems: Systems of Record (“Enterprise IT 1.0”) and Systems of Engagement (“the next stage of IT”). Systems of Record are:

global information systems that capture every dimension of our commercial landscape, from financial transactions to human resources to order processing to inventory management to customer relationship management to supply chain management to product lifecycle management, and on and on

Systems of engagement by contrast:

the focus instead will be on empowering the middle of the enterprise to communicate and collaborate across business boundaries, global time zones, and language and culture barriers, using next-generation IT applications and infrastructure adapted from the consumer space.

Systems of Record are the cost of doing business. They need to be highly optimized, low risk, rock solid and rely on a processes such as six sigma to deliver the quality and efficiency demanded by business. It is unlikely that these will be moved into the cloud in the near future.

The hypothesis (mine) here is that the systems of record are hosted on traditional IT / non-cloud infrastructure and private/public cloud hosts the systems of engagement.

Obviously the newer systems of engagement whether deployed on private clouds or public clouds may need access to the data held by the system of record.

If you have a private cloud for agile/systems of engagement, then the interaction looks like this:

Slide1If you use a public cloud for your systems of engagement, then it looks like:

Slide2

Yet another way to look at it might be the “pets vs. cattle” schema.

systems-of-record2_pets

Public clouds make this interconnection “easy” by providing required infrastructure. For example, AWS provides VPN Gateway and AWS Direct Connect. These facilities allow applications hosted on instances in the AWS cloud access resources that are “on-prem” (and vice-versa).

Theoretically the interconnect should be dead simple in the private cloud case. After all both parts of the infrastructure are hosted on the same local network infrastructure; presumably a single administrative domain. Complications can arise from:

  1. Business needs
  2. Artifacts of the private cloud implementation.

First the business needs: integrating systems of record and systems of engagement often involves crossing security boundaries: the former is guarded like Fort Knox; the latter has more fluid requirements. So, the solution might involve for example, inserting security devices in the path.

Slide2

The challenge is that the system on the right is extremely fluid: the network is constantly being reconfigured. Each change in the right might require changes in the security devices. The required level of network automation (to automate the security policy) is an unseen cost of implementing this architecture.

Private cloud networking brings its own complexities: it is often the most challenging part of implementing a private cloud. While the private cloud software stack might provide a solution that works within the cloud, it won’t provide a solution for the security policy automation problem mentioned above.

Bimodal IT is an interesting idea but can lead to ‘gaps’ between the modes, including the infrastructure domain. In a future post I hope to convince you that Apache CloudStack has some tricks up it’s sleeve to solve some of these problems.

How did they build that — EC2 Enhanced Networking

Among the flurry of new features introduced by AWS in 2013, is a performance enhancement known as ‘Enhanced Networking‘. According to the blurb: ” enhanced networking on your instance results in higher performance (packets per second), lower latency, and lower jitter’. The requirements are that you install an Intel 10GbE driver (ixgbef) in your instance and enable a feature called SR-IOV.

The AWS cloud is built around virtualization technology — specifically your instances are virtual machines running on top of a version of the open source Xen Hypervisor.
The hypervisor is what guarantees the isolation between my instance and your instance when they both run on the same set of CPUs.

The hypervisor intercepts all I/O from the virtual machine so that the virtual machine is abstracted from the hardware — this provides security as well as portability since the VM doesn’t need to care about the drivers for the I/O hardware. The VM sees a NIC that is software defined and as a result the hypervisor can inspect all traffic to and from the VM. -This allows AWS to control the networking traffic between the VM and the rest of the infrastructure. This is used to deliver features such as security groups and ACL.

The downside of processing all network traffic to/from the VM is that the host CPU cycles are consumed processing this traffic. This is quite a significant overhead compared to a bare-metal instance. The hypervisor needs to apply stateful firewall rules on every packet, switch the packet and encapsulate it. Some estimates put this overhead as high as 70% of the CPU available to the hypervisor (at 10 Gb/s). Software processing also introduces problems of noisy neighbors — variable jitter and high latency at 10Gbps are common.

Slide2

Fortunately, SR-IOV (Single Root IO Virtualization) provides a direct path for the VM to access the underlying hardware NIC. Bypassing the hypervisor leads to line-rate performance. Enhanced Networking takes advantage of this: in order to benefit from this, your AMI needs to have SR-IOV drivers installed in it.

Slide3

Great — but now that the hypervisor is out of the path, how does AWS provide software-defined features such as security groups and ACL? The current generation of SR-IOV NICs (AWS uses the Intel 82599 ) do not have stateful firewalls or the ability to have process large number of ACL. Furthermore, we know that AWS must be using some kind of encapsulation / tunnelling so that VPC are possible. The Intel 82599 does not provide encapsulation support.

The solution then would be to do the extra processing elsewhere — either off the host or in the host, using a co-processor. This schematic shows processing happening at the TOR switch. The drawback is that even intra-host traffic has to be tromboned via the TOR. Furthermore the switch now becomes a pretty big bottleneck and a failure in the switch could lead to several hosts losing network connectivity.Slide4

 

Using a co-processor would be the best solution. Tilera is one such processor that comes to mind. Since the Tilera provides general purpose processing cores, the encap/decap/filtering/stateful firewall processing could be done in software instead of ASICs or FPGAs.

Slide5

 

The software/hardware solution could allow AWS to introduce further innovations in its networking portfolio, including end-to-end encryption, IDS and IPS.

Disclaimer: I have no knowledge of AWS internals. This is just an exploration of “how did they build it?”.

Update: a confirmation of sorts on Werner Vogel’s blog: http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html

The SDN behemoth hiding in plain sight

Hint: it is Amazon Web Services (AWS). Let’s see it in action:

Create a VPC with 2 tiers: one public (10.0.0.0/24)  and one private (10.0.1.0/24). These are connected via a router. Spin up 2 instances, one in each tier (10.0.0.33 and 10.0.1.168).

[ec2-user@ip-10-0-0-33 ~]$ ping 10.0.1.168 -c 2
    PING 10.0.1.168 (10.0.1.168) 56(84) bytes of data.
    64 bytes from 10.0.1.168: icmp_seq=1 ttl=64 time=1.35 ms
    64 bytes from 10.0.1.168: icmp_seq=2 ttl=64 time=0.412 ms

The sharp-eyed might have noticed an oddity: the hop count (ttl) does not decrement despite the presence of a routing hop between the 2 subnets. So, clearly it isn’t a commercial router or any standard networking stack. AWS calls it an ‘implied router‘. What is likely happening is that the VPC is realized by creation of overlays. When the ethernet frame (ping packet) exits 10.0.0.33, the virtual switch on the hypervisor sends the ethernet frame directly to the hypervisor that is running 10.0.1.168 inside the overlay. The vswitches do not bother to decrement the ttl since that will cause an expensive recomputation of checksums in the ip header. Note that AWS introduced this feature in 2009 — well before open vswitch even had its first release.

One could also argue that security groups and elastic ips at the scale of AWS’s datacenters also bear the hallmarks of Software Defined Networking : clearly it required them to innovate beyond standard vendor gear to provide such to-date-unmatched L3 services. And these date back to the early days of AWS (2007 and 2008).

It doesn’t stop there. Elastic Load Balancing (ELB) from AWS orchestrates virtual load balancers across availability zones — providing L7 software defined networking. And that’s from 2009.

Last month’s ONS 2013 conference saw Google and Microsoft (among others) presenting facts and figures about the use of Software Defined Networking in their data centers. Given the far-and-away leadership of AWS in the cloud computing infrastructure space, it is interesting (to me) that AWS is seldom mentioned in the same breath as the pixie dust du jour “SDN”.

In CloudStack’s native overlay controller, implied routers has been on the wish list for some time. CloudStack also has a scalable implementation of security groups (although scaling to 100s of thousands of hypervisors might take another round of engineering). CloudStack also uses virtual appliances to provide L4-L7 services such as site-to-site IPSec VPN and load balancing. While the virtual networking feature-set in CloudStack is close to that of AWS, clearly the AWS implementation is likely an order of magnitude bigger.