Author Archives: Chiradeep Vittal

About Chiradeep Vittal

Exploring cloud architecture since 2008. I am a Apache CloudStack committer / PMC member.

Design patterns in Orchestrators (part 2 of N) – southbound APIs

This is part of a series on design patterns in building orchestration systems. The focus is on orchestrators found in clouds, data-centers, networking systems, etc, but the principles should be broadly applicable.

In a previous post we touched on an important issue that is a side effect of performing orchestration over a network: idempotent operations. The communication between the controller and the subsystems is sometimes called “southbound”, while the API offered by the controller is called “northbound”. Of course the northbound API could be the southbound API for a uber-controller and the southbound API could be the northbound API of a smaller system.

Specifying the contract between the controller and the south subsystem produces a tension between good enough and perfect. A REST API is sometimes used to write the specification since it implies specific requirements on the various verbs. For example the PUT and GET operations have to be idempotent while a POST need not be. However the system architect may have to posit idempotent properties for the POST operations as well as described previously. The REST API endpoint for the subsystem can also provide monitoring and other operational data useful to the system operator.

A drawback of specifying a REST API for a subsystem is that it tends to make the orchestrator hierarchical. The subsystem REST API implementation itself becomes a mini controller that needs its own locus of operations. A REST API can also be rigid, making it hard to evolve – this can be problematic especially in the early phases of system design. Finally, what could be a single hop between the controller and the subsystem, becomes two hops – this increases the operational burden.

orchestration2

Famously, the OpenStack project specifies REST APIs for various subsystems (Cinder – block device, Neutron – network subsystem). Subsystem vendors implement drivers/plugins that implement the southbound API.  The driver implementation could in turn call vendor REST APIs or other southbound APIs (e.g., OpenFlow, NetConf, SNMP) to various devices. Driver implementations are often mini-controllers that maintain the desired state of the subsystem in a persistent/durable store.

 

An alternate model is to specify the southbound API in the programming language of the controller, for example, as a Java interface (e.g., Apache CloudStack), or an OSGI plugin (OpenDaylight, ONOS).

orchestration3

The driver / plugin is responsible for translating the API call into the specific subsystem API call. For example hypervisor plugins for each hypervisor (XenServer, KVM, VMWare) in Apache CloudStack use the respective hypervisor APIs (XAPI, libvirt, vSphere API) to implement the hypervisor plugin API. Plugins in this case can use the persistent store of the main controller. A drawback of this approach is that the plugin has to be written in the language of the controller. Installing/upgrading a new plugin in a running production system may also produce some downtime. Last but not least, the system architect must be vigilant that the driver / plugin code not call back into the controller or directly use/modify the controller’s state store.

Adding support for a new vendor is easier in the the REST southbound API model – the vendor just has to provide an implementation translating the southbound REST API to the vendor’s API in a language of their choice. However, the addition of layers complicates operations, troubleshooting and upgrades.

The in-process plugin model of Apache CloudStack, OpenDaylight, etc., makes it easier to install and get a system operational. A single locus of operations also makes it easier to test, operate, troubleshoot and and upgrade. Developing a plugin, however, is more complicated since it requires a knowledge of the developer tooling used to develop the controller.

Design patterns in orchestrators (part 1 of n) – idempotent operations

Orchestration is a somewhat overloaded term in the context of automation. Generally, it implies a central controller that tries to bring a complicated system to a desired state. There are usually a large number of subsystems that the controller manages. Changing the state of the system involves communicating with the subsystems in order to get them to change their state. The communication usually happens over a network.orchestration1

As a simple example, consider a home automation controller that is trying to get the home ready to receive its occupants by:

  1. Setting the indoor temperature by setting the thermostats
  2. Opening the garage door
  3. Turning on lights
  4. Turning on the tea kettle

The network however is unreliable. There are several failure modes to consider:

  1. The message from the controller may never reach the subsystem. Usually the subsystem acknowledges the control messages from the controller. The controller may implement a timeout so that if the subsystem never gets the message, the controller times out waiting for the acknowledgement and executes some kind of recovery
  2. The message may reach the subsystem but the subsystem is not ready or not in a state to process it. The controller will get a negative acknowledgement in this case and needs to execute another kind of fault recovery procedure.
  3. The message reaches the subsystem and the subsystem executes the requested control, but fails to complete the requested task. For example, it may request a downstream subsystem to execute a task, but that downstream subsystem fails (again, perhaps due to the network). The controller may or may not get a different negative acknowledgement in this case. The subsystem may even fail midway through the task.
  4. The subsystem gets the message, executes the task perfectly, but the acknowledgement never reaches the controller. The controller usually times out and executes some kind of fault recovery procedure.

Distinguishing between these kinds of failures at the controller is a little hard. If there is a timeout, it can’t determine if the subsystem performed the requested task or not. A common recovery procedure is to re-try the command to the subsystem. Within this recovery mode, the controller has to decide:

  • how many retries
  • how long to retry
  • when to alert a human

Depending on the semantics of the task, there are different answers. Consider an orchestration flow where the controller has to set up a virtual machine. The tasks involved could be to allocate storage, program network elements such as switches, routers and DHCP servers, choose hypervisor hosts and so on. Any of these tasks could fail. Retrying indefinitely to allocate storage when there is not enough storage available doesn’t make sense. Retrying because there was a timeout might make sense. Alerting a human when there are hundreds or thousands of subsystems being modified doesn’t scale – it is better to design recoverability into the system.

When the controller re-tries the command to the subsystem, it is possible to have an unexpected effect. Let’s say the storage subsystem in the virtual machine example did allocate the storage as requested the first time, but the controller didn’t receive the acknowledgement. The controller retries the command, resulting in double allocation at the storage system.

The solution in this case is to ensure that the commands from the controller to the subsystem are idempotent. That is, executing the same command multiple times produces the same result. The trick is uniquely identify the change that is being requested. The subsystem stores/remembers the identifier so that if the change is re-requested, it doesn’t re-do the change. The identifier can be opaque (i.e., the structure or contents of the id have no semantics, like a uuid ) or be derived from the state description sent to the subsystem (e.g., a file name). Opaque identifiers help avoid  leaky abstractions between the controller and the subsystem. In many cases the subsystem cannot be modified to be idempotent (e.g., proprietary systems, different admin space), so a non-opaque identifier has to be used. Examples include fully-qualified domain names, filesystem paths and IP addresses.

The idempotency trick helps in another corner case: where the subsystem reboots / re-initializes or gets recreated due to a failure: it may not know the last command / desired state sent by the controller. For example, consider the case of the home automation system where a defective thermostat is replaced. The new thermostat contacts the home automation controller. The controller re-sends the last control command. Since the new thermostat doesn’t have a record of the unique identifier in the command, it applies the change requested by the command.

A complex system with many subsystems and resources is constantly changing state independent of the controller. For example, hosts reboot, network switches go down, disks fail, and so on. The controller has to detect when the system has drifted from the desired state and then execute compensating commands to the subsystems to bring them back to the desired state. Having idempotent commands with unique identifiers is crucial to this recovery.

Architects of orchestration controllers often discover the need for idempotent operations well after implementation is in production. Since the controller usually in turn offers an API, the system architect has to ensure that this “northbound” API also supports idempotent commands / operations. Even Amazon Web Services (AWS) introduced idempotent run-instances quite late in the game (2010).

A (round)trip with Java and Go

Can you generate Go code from a Java binary jar? This need arose from writing a prototype Ingress Controller  for Kubernetes that uses a NetScaler to provide the Ingress function. Another need was a Terraform driver for NetScaler. (While the NetScaler Ingress controller didn’t need a Golang client, it is customary to write Kubernetes integrations using Go).

NITRO is the REST API to program the Citrix NetScaler load balancer. The API is easy to use with well-defined usage patterns. Most commonly, JSON is used to create/update/read/delete configuration on the NetScaler.  There are Java and Python clients, (as well as PowerShell and Perl), but I needed a Golang client to NITRO.

There were a few roadblocks to producing the Go client. One problem is that the NITRO API is vast (over 1000 config objects with corresponding JSON definitions). Second, the JSON  documentation is in HTML docs, or, one has to resort to tools like Postman to reverse engineer the JSON schema.

After studying the NITRO Java SDK source, it occurred to me that  each config object in the REST API had a corresponding Java class. For example com.citrix.netscaler.nitro.resource.config.lb.lbvserver had fields that represented the possible fields in the JSON config object to configure a lbvserver 

The task was to generate a JSON v4 schema from the NITRO Java SDK. This was relatively easy after finding JJSchema . This project generates a JSON schema from Java classes. There were a few changes required to make JJSchema work with the NITRO jar: JJSchema assumed that all fields had an accessor of the form getFoo(): In the case of NITRO, it was get_foo(). JJSchema also relies on field annotations (@Attributes) to figure out metadata such as enums and read-only. For enums, the Java classes in the NITRO package had inner classes with static member constants, and figuring out the readonly attribute was a matter of finding fields that didn’t have set_foo() methods. The resulting changes are in a fork: https://github.com/chiradeep/JJSchema. The code to invoke JJSchema is in https://github.com/chiradeep/json-nitro. This involves determining all the subclasses of com.citrix.netscaler.nitro.resource.config and invoking the forked JJSchema on those classes.

Getting from JSON schema to Go involves another open source project (Generate). As the blurb says, Generate generates Go Structs from JSON schema. The generated schema is used in a (somewhat incomplete) Go client to the NITRO API (https://github.com/chiradeep/go-nitro).

Producing the JSON schema has the nice side-effect that it should be easier to write new clients (Ruby/Javascript anyone?) to the NITRO API. Who knows which language will catch the fancy of systems geeks a year from now (Rust?).

 

 

Consul-template and Citrix Netscaler

Consul-template (consul-template) is a tool that can drive reconfiguration of applications and infrastructure in response to changes in the keys/values stored in Consul. Usually it is used to populate changes into a local file in the filesystem. Following the change, the application or infrastructure software is usually restarted.

Previously I’d written up integrations between container managers such as Kubernetes and Netscaler. It was a relatively simple matter to include support for consul-template with a slight tweak. Netscaler only supports a REST-based configuration API (“Nitro“), so populating a config file on Netscaler was not going to do the job. The solution was simple: write a JSON file using consul-template and then ask consul-template to execute a python script to convert the JSON to Nitro API calls.

So:

consul-template -consul $CONSUL_IP:8500 -template consul_single_svc.ctmpl:cfg.json:"python main.py --cfg-file cfg.json

Here cfg.json is the intermediate JSON file produced by consul-template.

You can get the code here

Apple’s iCloud is a multi-cloud beast

Apple device users have probably taken and stored 100 billion photos:

  • In early 2013, the number was 9 billion
  • There are 100 million iPhones in active use in 2015. If each iPhone takes 1000 pictures per year, that’s 100 billion photos in 2015 alone.
  • Photos are automatically backed up to iCloud since iOS 5

I’d assumed that iCloud is a massive compute and storage cloud, operated like the datacenters of Google and Amazon.

Turns out that, at least for photo storage, iCloud is actually composed of Amazon’s S3 storage service and Google’s Cloud Storage service. I serendipitously discovered this while copying some photos from my camera’s SD card to my Macbook using the native Photos app. I’d recently installed  ‘Little Snitch‘ to see why the camera light on my Macbook turns on for no reason. Little Snitch immediately alerted me that Photos was trying to connect to Amazon’s S3 and Google’s Cloud Storage:

So it looks like Apple is outsourcing iCloud storage to two different clouds. At first glance this is strange: AWS S3 promises durability of 99.999999999%, so backing up to Google gains very little reliability for a doubling of cost.

It turns out that that AWS S3 and Google Storage are used differently:

For the approximately 200 hi-res photos that I was copying from my camera’s SD card, AWS S3 stores a LOT (1.58 GB), while Google stores a measly 50 MB. So Apple is probably using Google for something else. Speculation:

AWS S3 has an SLA of 99.99%. For the cases where it is unavailable (but photos are still safe), Google can be used to store / fetch low-res versions of the Photo stream.

The Google location could also be used to store an erasure code, although from the size, it seems unlikely.

Apple charges me $2.99 per month (reduced from $3.99 per month last fall) for 200GB of iCloud storage. Apple should be paying (according to the published pricing) between $2.50 and $5.50 per month to Amazon AWS for this. Add in a few pennies for Google’s storage, they are are probably break-even or slightly behind. If they were to operate their own S3-like storage, they would probably make a small -to- medium profit instead. I’ve calculated some numbers based on 2 MB per iPhone image.

Profit/Loss
per TB per month
2 PB – 1 billion
photos
20 PB – 10
billion photos
200 PB – 100
billion photos
2000 PB – 1
trillion photos
-$5 -$10,000 -$100,000 -$1,000,000 -$10,000,000
-$10 -$20,000 -$200,000 -$2,000,000 -$20,000,000
$10 $20,000 $200,000 $2,000,000 $20,000,000
$20 $40,000 $400,000 $4,000,000 $40,000,000

Given Apple’s huge profits of nearly $70 billion per year, paying Amazon about a quarter a billion for worry-free, infinitely scalable storage seems worth it.

I haven’t included the cost of accessing the data from S3, which can be quite prohibitive, but I suspect that Apple uses a content delivery network (CDN) for delivering the photos to your photo stream.

Untitled

Multi-cloud is clearly not a mythical beast. It is here and big companies like Apple are already taking advantage of it.

Save money on your AWS bill

A couple of years ago I was confronted with a bill of several hundred dollars because I’d forgotten to turn off some machines on AWS ( I think it was an ELB – elastic load balancer). Since then, I make it a point to login and check often to see if I’ve left stuff running. I’ve automated this simple check here: https://github.com/chiradeep/idle-instance-reaper

You can run the check using AWS Lambda as well. Just make sure you configure a ScheduledEvent trigger for it.

AWS_Lambda

Hope you save some money with this tip.

 

 

Automated configuration of NetScaler Loadbalancer for Kubernetes, Mesos and Docker Swarm

There are an incredible number of Cluster Managers for containerized workloads. The top clustered container managers include Google’s Kubernetes, the Marathon framework on Apache Mesos and Docker Swarm. While these managers offer powerful scheduling and autonomic capabilities, integration with external load balancers is often left as an exercise for the user. Since load balancers are essential components in a horizontally-scaled microservice this omission can impede the roll-out of your chosen container manager. Further, since a microservice architecture demands rapid deployment, any solution has to be able to keep up with the changes in the topology and structure of the microservice.

Citrix NetScaler is an application delivery controller widely used in load balancing applications at several Web-scale companies. This blog post describes Nitrox, a containerized application that can work with Docker Swarm or Kubernetes or Marathon (on Apache Mesos) to automatically reconfigure a Citrix NetScaler instance in response to application events such as deployment, rolling upgrades and auto/manual scale.

The figure below shows a scaling event that causes the number of backend containers for application α to grow from 4 to 6. The endpoints of the additional containers have to be provisioned into the NetScaler as a result.

scaleout

All cluster managers offer an event stream of application lifecycle events. In this case, Docker Swarm sends a container start event; Marathon sends a status_update_event with a taskStatus field and Kubernetes sends an endpoint update event. The job of Nitrox is to listen to these event streams and react to the events. In most cases this fires a query back to the cluster manager to obtain the new list of endpoints. This list of endpoints is then configured into NetScaler.

nitrox

The reconfiguration of Netscaler is idempotent and complete: if the endpoint already exists in the Netscaler configuration, it isn’t re-done. This prevents unnecessary reconfiguration. The set of endpoints sent to Netscaler is not-incremental: the entire set is sent. This overcomes any problem with missed / dropped events and causes the Netscaler configuration to be eventually consistent.

Another choice made was to let the application / NetScaler admin provision the frontend details of the application. The front-end has myriad options such as lbmethod, persistence and stickiness. It is likely that each application has different needs for this configuration; also it is assumed to be chosen once and not dependent on size and scope of the backends.

You can find Nitrox  code and instructions here. The container is available on Docker Hub as chiradeeptest/nitrox. The containerized Nitrox can be scheduled like a regular workload on each of the container managers: docker run on Docker Swarm, an app on Marathon and a replication controller on Kubernetes.

Implementation Notes

Kubernetes ( and optionally: Docker Swarm) requires virtual networking (Kubernetes is usually  used with flannel). Therefore the container endpoints are endpoints on a virtual network. Since the NetScaler doesn’t participate in the virtual network (consider a non-virtualized NetScaler), this becomes a problem. For Docker Swarm, Nitrox assumes that bridged networking is used.

For Kubernetes, it is assumed that the service (app) being load balanced is configured to use NodePort style of exposing the service to external access. Kubernetes chooses a random port and exposes this port on every node in the cluster. Each node has a proxy that can provide access on this port. The proxy load balances the ingress traffic to each backend pod (container). One strategy then would be to simply configure the NetScaler to load balance to every node in the cluster. However, even if there are say 2 containers in the application but there are 50 nodes, then the Netscaler would needlessly send the traffic to many nodes. To make this more efficient, Nitrox figures out the list of nodes that the containers are actually running on and provisions these endpoints on the NetScaler.

 

Quick Tip: Docker Machine on Apache CloudStack and XenServer

There is now Docker Machine support for Apache CloudStack. See @atsaki‘s work at https://github.com/atsaki/docker-machine-driver-cloudstack

docker-machine create -d cloudstack \
--cloudstack-api-url CLOUDSTACK_API_URL \
--cloudstack-api-key CLOUDSTACK_API_KEY \
--cloudstack-secret-key CLOUDSTACK_SECRET_KEY \
--cloudstack-template "Ubuntu Server 14.04" \
--cloudstack-zone "zone01" \
--cloudstack-service-offering "Small" \
--cloudstack-expunge \
docker-machine

Another way to do this is to launch your VM in CloudStack and then use the generic driver (assuming you have the private key from your sshkeypair):

docker-machine create -d generic \
--generic-ip-address=VM_IP\
--generic-ssh-key=SSH_PRIVATE_KEY  \

--generic-ssh-user=SSH_USER

This will ALSO work for plain old VMs created on XenServer  (which currently does not have a driver).

Bonus: in either case you can use docker-machine to set up a Docker Swarm by adding the parameters:

--swarm \
--swarm-discovery token://\

Apache Mesos and Kubernetes on Apache CloudStack

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes. Mesos is used by a number of web-scale companies such as Twitter, Airbnb and even Apple.

Containers Layers

Cluster managers such as Mesos and Kubernetes are easier to set up than a full-blown IAAS stack: they do not orchestrate network, storage and other services. Plus they solve the problem of dividing (and bundling) the capacity of a single virtual machine into more useful chunks. Mesos can schedule containers on the cluster in addition to other workloads. Cluster managers are easy to setup on traditional virtualization infrastructure as well (check out Citrix Lifecycle Manager  for an example). But without persistent volumes, load balancers and other network services, cluster managers may not be able to tackle the full range of workloads handled by IAAS.

If you already have Apache CloudStack up and running and want to run a cluster manager on it, it just got easier. I used Packer and Terraform to completely automate  the provisioning of a full Mesos cluster. This recipe (here) first uses Packer to build a re-usable Ubuntu 14.04 image with the required packages installed (Zookeeper, Mesos, Marathon, etc). The Terraform configuration drives the creation of the cluster using this template.

Just for completeness, I have  Kubernetes-on-CloudStack automated as well (using Terraform). For better or worse, both Mesos and Kubernetes are rapidly evolving, so the automation may be broken by the time you are trying it out. Feel free to open a pull request to correct any errors.

Farming your CloudStack cloud

A couple of years ago, I blogged about my prototype of StackMate, a tool and a service that interprets AWS CloudFormation-style templates and creates CloudStack resources. The idea was to provide an application management solution. I didn’t develop the idea beyond a working prototype. Terraform from Hashicorp is a similar idea, but with the ability to add extensions (providers)  to drive resource creation in different clouds and service providers. Fortunately Terraform is solid and widely used. Even better, Sander van Harmelen (@_svanharmelen_) has written a well-documented CloudStack provider.

Terraform templates have a different (but json-style) syntax than AWS Cloudformation, which among other things lets you add comments. Like StackMate, it figures out the order of resource creation by creating a dependency graph. You can also add explicity “depends_on” relationships. I played around with Terraform and created a couple of templates here:

https://github.com/chiradeep/terraform-cloudstack-examples

One template creates a VPC and 2 subnets and 2 VMS. The other template creates 2 isolated networks and a couple of VMs (one with nics on both networks).

Pull requests accepted.

While there are awesome services and products out there that can do similar things (RightScale, Scalr, Citrix Lifecycle Management), it is great to see something open sourced and community-driven.