Design patterns in orchestrators: transfer of desired state (part 3/N)

Most datacenter automation tools operate on the basis of desired state. Desired state describes what should be the end state but not how to get there. To simplify a great deal, if the thing being automated is the speed of a car, the desired state may be “60mph”. How to get there (braking, accelerator, gear changes, turbo) isn’t specified. Something (an “agent”) promises to maintain that desired speed.


The desired state and changes to the desired state are sent from the orchestrator to various agents in a datacenter. For example, the desired state may be “two apache containers running on host X”. An agent on host X will ensure that the two containers are running. If one or more containers die, then the agent on host X will start enough containers to bring the count up to two. When the orchestrator changes the desired state to “3 apache containers running on host X”, then the agent on host X will create another container to match the desired state.

Transfer of desired state is another way to achieve idempotence (a problem described here)

We can see that there are two sources of changes that the agent has to react to:

  1. changes to desired state sent from the orchestrator and
  2. drift in the actual state due to independent / random events.

Let’s examine #1 in greater detail. There’s a few ways to communicate the change in desired state:

  1. Send the new desired state to the agent (a “command” pattern). This approach works most of the time, except when the size of the state is very large. For instance, consider an agent responsible for storing a million objects. Deleting a single object would involve sending the whole desired state (999999 items). Another problem is that the command may not reach the agent (“the network is not reliable”). Finally, the agent may not be able to keep up with rate of change of desired state and start to drop some commands.  To fix this issue, the system designer might be tempted to run more instances of the agent; however, this usually leads to race conditions and out-of-order execution problems.
  2. Send just the delta from the previous desired state. This is fraught with problems. This assumes that the controller knows for sure that the previous desired state was successfully communicated to the agent, and that the agent has successfully implemented the previous desired state. For example, if the first desired state was “2 running apache containers” and the delta that was sent was “+1 apache container”, then the final actual state may or may not be “3 running apache containers”. Again, network reliability is a problem here. The rate of change is an even bigger potential problem here: if the agent is unable to keep up with the rate of change, it may drop intermediate delta requests. The final actual state of the system may be quite different from the desired state, but the agent may not realize it! Idempotence in the delta commands helps in this case.
  3. Send just an indication of change (“interrupt”). The agent has to perform the additional step of fetching the desired state from the controller. The agent can compute the delta and change the actual state to match the delta. This has the advantage that the agent is able to combine the effects of multiple changes (“interrupt debounce”). By coalescing the interrupts, the agent is able to limit the rate of change. Of course the network could cause some of these interrupts to get “lost” as well. Lost interrupts can cause the actual state to diverge from the desired state for long periods of time. Finally, if the desired state is very large, the agent and the orchestrator have to coordinate to efficiently determine the change to the desired state.
  4. The agent could poll the controller for the desired state. There is no problem of lost interrupts; the next polling cycle will always fetch the latest desired state. The polling rate is critical here: if it is too fast, it risks overwhelming the orchestrator even when there are no changes to the desired state; if too slow, it will not converge the the actual state to the desired state quickly enough.

To summarize the potential issues:

  1. The network is not reliable. Commands or interrupts can be lost or agents can restart / disconnect: there has to be some way for the agent to recover the desired state
  2. The desired state can be prohibitively large. There needs to be some way to efficiently but accurately communicate the delta to the agent.
  3. The rate of change of the desired state can strain the orchestrator, the network and the agent. To preserve the stability of the system, the agent and orchestrator need to coordinate to limit the rate of change, the polling rate and to execute the changes in the proper linear order.
  4. Only the latest desired state matters. There has to be some way for the agent to discard all the intermediate (“stale”) commands and interrupts that it has not been able to process.
  5. Delta computation (the difference between two consecutive sets of desired state) can sometimes be more efficiently performed at the orchestrator, in which case the agent is sent the delta. Loss of the delta message or reordering of execution can lead to irrecoverable problems.

A persistent message queue can solve some of these problems. The orchestrator sends its commands or interrupts to the queue and the agent reads from the queue. The message queue buffers commands or interrupts while the agent is busy processing a desired state request.  The agent and the orchestrator are nicely decoupled: they don’t need to discover each other’s location (IP/FQDN). Message framing and transport are taken care of (no more choosing between Thrift or text or HTTP or gRPC etc).


There are tradeoffs however:

  1. With the command pattern, if the desired state is large, then the message queue could reach its storage limits quickly. If the agent ends up discarding most commands, this can be quite inefficient.
  2. With the interrupt pattern, a message queue is not adding much value since the agent will talk directly to the orchestrator anyway.
  3. It is not trivial to operate / manage / monitor a persistent queue. Messages may need to be aggressively expired / purged, and the promise of persistence may not actually be realized. Depending on the scale of the automation, this overhead may not be worth the effort.
  4. With an “at most once” message queue, it could still lose messages. With  “at least once” semantics, the message queue could deliver multiple copies of the same message: the agent has to be able to determine if it is a duplicate. The orchestrator and agent still have to solve some of the end-to-end reliability problems.
  5. Delta computation is not solved by the message queue.

OpenStack (using RabbitMQ) and CloudFoundry (using NATS) have adopted message queues to communicate desired state from the orchestrator to the agent.  Apache CloudStack doesn’t have any explicit message queues, although if one digs deeply, there are command-based message queues simulated in the database and in memory.

Others solve the problem with a combination of interrupts and polling – interrupt to execute the change quickly, poll to recover from lost interrupts.

Kubernetes is one such framework. There are no message queues, and it uses an explicit interrupt-driven mechanism to communicate desired state from the orchestrator (the “API Server”) to its agents (called “controllers”).

Courtesy of Heptio

(Image courtesy:

Developers can use (but are not forced to use) a controller framework to write new controllers. An instance of a controller embeds an “Informer” whose responsibility is to watch for changes in the desired state and execute a controller function when there is a change. The Informer takes care of caching the desired state locally and computing the delta state when there are changes. The Informer leverages the “watch” mechanism in the Kubernetes API Server (an interrupt-like system that delivers a network notification when there is a change to a stored key or value). The deltas to the desired state are queued internally in the Informer’s memory. The Informer ensures the changes are executed in the correct order.

  • Desired states are versioned, so it is easier to decide to compute a delta, or to discard an interrupt.
  • The Informer can be configured to do a periodic full resync from the orchestrator (“API Server”) – this should take care of the problem of lost interrupts.
  • Apparently, there is no problem of the desired state being too large, so Kubernetes does not explicitly handle this issue.
  • It is not clear if the Informer attempts to rate-limit itself when there are excessive watches being triggered.
  • It is also not clear if at some point the Informer “fast-forwards” through its queue of changes.
  • The watches in the API Server use Etcd watches in turn. The watch server in the API server only maintains a limited set of watches received from Etcd and discards the oldest ones.
  • Etcd itself is a distributed data store that is more complex to operate than say, an SQL database. It appears that the API server hides the Etcd server from the rest of the system, and therefore Etcd could be replaced with some other store.

I wrote a Network Policy Controller for Kubernetes using this framework and it was the easiest integration I’ve written.

It is clear that the Kubernetes creators put some thought into the architecture, based on their experiences at Google. The Kubernetes design should inspire other orchestrator-writers, or perhaps, should be re-used for other datacenter automation purposes. A few issues to consider:

  • The agents (“controllers”) need direct network reachability to the API Server. This may not be possible in all scenarios, needing another level of indirection
  • The API server is not strictly an orchestrator, it is better described as a choreographer. I hope to describe this difference in a later blog post, but note that the API server never explicitly carries out a step-by-step flow of operations.

Design patterns in Orchestrators (part 2 of N) – southbound APIs

This is part of a series on design patterns in building orchestration systems. The focus is on orchestrators found in clouds, data-centers, networking systems, etc, but the principles should be broadly applicable.

In a previous post we touched on an important issue that is a side effect of performing orchestration over a network: idempotent operations. The communication between the controller and the subsystems is sometimes called “southbound”, while the API offered by the controller is called “northbound”. Of course the northbound API could be the southbound API for a uber-controller and the southbound API could be the northbound API of a smaller system.

Specifying the contract between the controller and the south subsystem produces a tension between good enough and perfect. A REST API is sometimes used to write the specification since it implies specific requirements on the various verbs. For example the PUT and GET operations have to be idempotent while a POST need not be. However the system architect may have to posit idempotent properties for the POST operations as well as described previously. The REST API endpoint for the subsystem can also provide monitoring and other operational data useful to the system operator.

A drawback of specifying a REST API for a subsystem is that it tends to make the orchestrator hierarchical. The subsystem REST API implementation itself becomes a mini controller that needs its own locus of operations. A REST API can also be rigid, making it hard to evolve – this can be problematic especially in the early phases of system design. Finally, what could be a single hop between the controller and the subsystem, becomes two hops – this increases the operational burden.


Famously, the OpenStack project specifies REST APIs for various subsystems (Cinder – block device, Neutron – network subsystem). Subsystem vendors implement drivers/plugins that implement the southbound API.  The driver implementation could in turn call vendor REST APIs or other southbound APIs (e.g., OpenFlow, NetConf, SNMP) to various devices. Driver implementations are often mini-controllers that maintain the desired state of the subsystem in a persistent/durable store.


An alternate model is to specify the southbound API in the programming language of the controller, for example, as a Java interface (e.g., Apache CloudStack), or an OSGI plugin (OpenDaylight, ONOS).


The driver / plugin is responsible for translating the API call into the specific subsystem API call. For example hypervisor plugins for each hypervisor (XenServer, KVM, VMWare) in Apache CloudStack use the respective hypervisor APIs (XAPI, libvirt, vSphere API) to implement the hypervisor plugin API. Plugins in this case can use the persistent store of the main controller. A drawback of this approach is that the plugin has to be written in the language of the controller. Installing/upgrading a new plugin in a running production system may also produce some downtime. Last but not least, the system architect must be vigilant that the driver / plugin code not call back into the controller or directly use/modify the controller’s state store.

Adding support for a new vendor is easier in the the REST southbound API model – the vendor just has to provide an implementation translating the southbound REST API to the vendor’s API in a language of their choice. However, the addition of layers complicates operations, troubleshooting and upgrades.

The in-process plugin model of Apache CloudStack, OpenDaylight, etc., makes it easier to install and get a system operational. A single locus of operations also makes it easier to test, operate, troubleshoot and and upgrade. Developing a plugin, however, is more complicated since it requires a knowledge of the developer tooling used to develop the controller.

Design patterns in orchestrators (part 1 of n) – idempotent operations

Orchestration is a somewhat overloaded term in the context of automation. Generally, it implies a central controller that tries to bring a complicated system to a desired state. There are usually a large number of subsystems that the controller manages. Changing the state of the system involves communicating with the subsystems in order to get them to change their state. The communication usually happens over a network.orchestration1

As a simple example, consider a home automation controller that is trying to get the home ready to receive its occupants by:

  1. Setting the indoor temperature by setting the thermostats
  2. Opening the garage door
  3. Turning on lights
  4. Turning on the tea kettle

The network however is unreliable. There are several failure modes to consider:

  1. The message from the controller may never reach the subsystem. Usually the subsystem acknowledges the control messages from the controller. The controller may implement a timeout so that if the subsystem never gets the message, the controller times out waiting for the acknowledgement and executes some kind of recovery
  2. The message may reach the subsystem but the subsystem is not ready or not in a state to process it. The controller will get a negative acknowledgement in this case and needs to execute another kind of fault recovery procedure.
  3. The message reaches the subsystem and the subsystem executes the requested control, but fails to complete the requested task. For example, it may request a downstream subsystem to execute a task, but that downstream subsystem fails (again, perhaps due to the network). The controller may or may not get a different negative acknowledgement in this case. The subsystem may even fail midway through the task.
  4. The subsystem gets the message, executes the task perfectly, but the acknowledgement never reaches the controller. The controller usually times out and executes some kind of fault recovery procedure.

Distinguishing between these kinds of failures at the controller is a little hard. If there is a timeout, it can’t determine if the subsystem performed the requested task or not. A common recovery procedure is to re-try the command to the subsystem. Within this recovery mode, the controller has to decide:

  • how many retries
  • how long to retry
  • when to alert a human

Depending on the semantics of the task, there are different answers. Consider an orchestration flow where the controller has to set up a virtual machine. The tasks involved could be to allocate storage, program network elements such as switches, routers and DHCP servers, choose hypervisor hosts and so on. Any of these tasks could fail. Retrying indefinitely to allocate storage when there is not enough storage available doesn’t make sense. Retrying because there was a timeout might make sense. Alerting a human when there are hundreds or thousands of subsystems being modified doesn’t scale – it is better to design recoverability into the system.

When the controller re-tries the command to the subsystem, it is possible to have an unexpected effect. Let’s say the storage subsystem in the virtual machine example did allocate the storage as requested the first time, but the controller didn’t receive the acknowledgement. The controller retries the command, resulting in double allocation at the storage system.

The solution in this case is to ensure that the commands from the controller to the subsystem are idempotent. That is, executing the same command multiple times produces the same result. The trick is uniquely identify the change that is being requested. The subsystem stores/remembers the identifier so that if the change is re-requested, it doesn’t re-do the change. The identifier can be opaque (i.e., the structure or contents of the id have no semantics, like a uuid ) or be derived from the state description sent to the subsystem (e.g., a file name). Opaque identifiers help avoid  leaky abstractions between the controller and the subsystem. In many cases the subsystem cannot be modified to be idempotent (e.g., proprietary systems, different admin space), so a non-opaque identifier has to be used. Examples include fully-qualified domain names, filesystem paths and IP addresses.

The idempotency trick helps in another corner case: where the subsystem reboots / re-initializes or gets recreated due to a failure: it may not know the last command / desired state sent by the controller. For example, consider the case of the home automation system where a defective thermostat is replaced. The new thermostat contacts the home automation controller. The controller re-sends the last control command. Since the new thermostat doesn’t have a record of the unique identifier in the command, it applies the change requested by the command.

A complex system with many subsystems and resources is constantly changing state independent of the controller. For example, hosts reboot, network switches go down, disks fail, and so on. The controller has to detect when the system has drifted from the desired state and then execute compensating commands to the subsystems to bring them back to the desired state. Having idempotent commands with unique identifiers is crucial to this recovery.

Architects of orchestration controllers often discover the need for idempotent operations well after implementation is in production. Since the controller usually in turn offers an API, the system architect has to ensure that this “northbound” API also supports idempotent commands / operations. Even Amazon Web Services (AWS) introduced idempotent run-instances quite late in the game (2010).

A (round)trip with Java and Go

Can you generate Go code from a Java binary jar? This need arose from writing a prototype Ingress Controller  for Kubernetes that uses a NetScaler to provide the Ingress function. Another need was a Terraform driver for NetScaler. (While the NetScaler Ingress controller didn’t need a Golang client, it is customary to write Kubernetes integrations using Go).

NITRO is the REST API to program the Citrix NetScaler load balancer. The API is easy to use with well-defined usage patterns. Most commonly, JSON is used to create/update/read/delete configuration on the NetScaler.  There are Java and Python clients, (as well as PowerShell and Perl), but I needed a Golang client to NITRO.

There were a few roadblocks to producing the Go client. One problem is that the NITRO API is vast (over 1000 config objects with corresponding JSON definitions). Second, the JSON  documentation is in HTML docs, or, one has to resort to tools like Postman to reverse engineer the JSON schema.

After studying the NITRO Java SDK source, it occurred to me that  each config object in the REST API had a corresponding Java class. For example had fields that represented the possible fields in the JSON config object to configure a lbvserver 

The task was to generate a JSON v4 schema from the NITRO Java SDK. This was relatively easy after finding JJSchema . This project generates a JSON schema from Java classes. There were a few changes required to make JJSchema work with the NITRO jar: JJSchema assumed that all fields had an accessor of the form getFoo(): In the case of NITRO, it was get_foo(). JJSchema also relies on field annotations (@Attributes) to figure out metadata such as enums and read-only. For enums, the Java classes in the NITRO package had inner classes with static member constants, and figuring out the readonly attribute was a matter of finding fields that didn’t have set_foo() methods. The resulting changes are in a fork: The code to invoke JJSchema is in This involves determining all the subclasses of com.citrix.netscaler.nitro.resource.config and invoking the forked JJSchema on those classes.

Getting from JSON schema to Go involves another open source project (Generate). As the blurb says, Generate generates Go Structs from JSON schema. The generated schema is used in a (somewhat incomplete) Go client to the NITRO API (

Producing the JSON schema has the nice side-effect that it should be easier to write new clients (Ruby/Javascript anyone?) to the NITRO API. Who knows which language will catch the fancy of systems geeks a year from now (Rust?).



Consul-template and Citrix Netscaler

Consul-template (consul-template) is a tool that can drive reconfiguration of applications and infrastructure in response to changes in the keys/values stored in Consul. Usually it is used to populate changes into a local file in the filesystem. Following the change, the application or infrastructure software is usually restarted.

Previously I’d written up integrations between container managers such as Kubernetes and Netscaler. It was a relatively simple matter to include support for consul-template with a slight tweak. Netscaler only supports a REST-based configuration API (“Nitro“), so populating a config file on Netscaler was not going to do the job. The solution was simple: write a JSON file using consul-template and then ask consul-template to execute a python script to convert the JSON to Nitro API calls.


consul-template -consul $CONSUL_IP:8500 -template consul_single_svc.ctmpl:cfg.json:"python --cfg-file cfg.json

Here cfg.json is the intermediate JSON file produced by consul-template.

You can get the code here

Apple’s iCloud is a multi-cloud beast

Apple device users have probably taken and stored 100 billion photos:

  • In early 2013, the number was 9 billion
  • There are 100 million iPhones in active use in 2015. If each iPhone takes 1000 pictures per year, that’s 100 billion photos in 2015 alone.
  • Photos are automatically backed up to iCloud since iOS 5

Update: This article estimates the number of active iPhones at 700 Million (March 2017). During WWDC 2017, Apple revealed that iPhone users take 1 trillion photos a year.

I’d assumed that iCloud is a massive compute and storage cloud, operated like the datacenters of Google and Amazon.

Turns out that, at least for photo storage, iCloud is actually composed of Amazon’s S3 storage service and Google’s Cloud Storage service. I serendipitously discovered this while copying some photos from my camera’s SD card to my Macbook using the native Photos app. I’d recently installed  ‘Little Snitch‘ to see why the camera light on my Macbook turns on for no reason. Little Snitch immediately alerted me that Photos was trying to connect to Amazon’s S3 and Google’s Cloud Storage:

So it looks like Apple is outsourcing iCloud storage to two different clouds. At first glance this is strange: AWS S3 promises durability of 99.999999999%, so backing up to Google gains very little reliability for a doubling of cost.

It turns out that that AWS S3 and Google Storage are used differently:

For the approximately 200 hi-res photos that I was copying from my camera’s SD card, AWS S3 stores a LOT (1.58 GB), while Google stores a measly 50 MB. So Apple is probably using Google for something else. Speculation:

AWS S3 has an SLA of 99.99%. For the cases where it is unavailable (but photos are still safe), Google can be used to store / fetch low-res versions of the Photo stream.

The Google location could also be used to store an erasure code, although from the size, it seems unlikely.

Apple charges me $2.99 per month (reduced from $3.99 per month last fall) for 200GB of iCloud storage. Apple should be paying (according to the published pricing) between $2.50 and $5.50 per month to Amazon AWS for this. Add in a few pennies for Google’s storage, they are are probably break-even or slightly behind. If they were to operate their own S3-like storage, they would probably make a small -to- medium profit instead. I’ve calculated some numbers based on 2 MB per iPhone image.

per TB per month
2 PB – 1 billion
20 PB – 10
billion photos
200 PB – 100
billion photos
2000 PB – 1
trillion photos
-$5 -$10,000 -$100,000 -$1,000,000 -$10,000,000
-$10 -$20,000 -$200,000 -$2,000,000 -$20,000,000
$10 $20,000 $200,000 $2,000,000 $20,000,000
$20 $40,000 $400,000 $4,000,000 $40,000,000

Given Apple’s huge profits of nearly $70 billion per year, paying Amazon about a quarter a billion for worry-free, infinitely scalable storage seems worth it.

I haven’t included the cost of accessing the data from S3, which can be quite prohibitive, but I suspect that Apple uses a content delivery network (CDN) for delivering the photos to your photo stream.


Multi-cloud is clearly not a mythical beast. It is here and big companies like Apple are already taking advantage of it.

Save money on your AWS bill

A couple of years ago I was confronted with a bill of several hundred dollars because I’d forgotten to turn off some machines on AWS ( I think it was an ELB – elastic load balancer). Since then, I make it a point to login and check often to see if I’ve left stuff running. I’ve automated this simple check here:

You can run the check using AWS Lambda as well. Just make sure you configure a ScheduledEvent trigger for it.


Hope you save some money with this tip.



Automated configuration of NetScaler Loadbalancer for Kubernetes, Mesos and Docker Swarm

There are an incredible number of Cluster Managers for containerized workloads. The top clustered container managers include Google’s Kubernetes, the Marathon framework on Apache Mesos and Docker Swarm. While these managers offer powerful scheduling and autonomic capabilities, integration with external load balancers is often left as an exercise for the user. Since load balancers are essential components in a horizontally-scaled microservice this omission can impede the roll-out of your chosen container manager. Further, since a microservice architecture demands rapid deployment, any solution has to be able to keep up with the changes in the topology and structure of the microservice.

Citrix NetScaler is an application delivery controller widely used in load balancing applications at several Web-scale companies. This blog post describes Nitrox, a containerized application that can work with Docker Swarm or Kubernetes or Marathon (on Apache Mesos) to automatically reconfigure a Citrix NetScaler instance in response to application events such as deployment, rolling upgrades and auto/manual scale.

The figure below shows a scaling event that causes the number of backend containers for application α to grow from 4 to 6. The endpoints of the additional containers have to be provisioned into the NetScaler as a result.


All cluster managers offer an event stream of application lifecycle events. In this case, Docker Swarm sends a container start event; Marathon sends a status_update_event with a taskStatus field and Kubernetes sends an endpoint update event. The job of Nitrox is to listen to these event streams and react to the events. In most cases this fires a query back to the cluster manager to obtain the new list of endpoints. This list of endpoints is then configured into NetScaler.


The reconfiguration of Netscaler is idempotent and complete: if the endpoint already exists in the Netscaler configuration, it isn’t re-done. This prevents unnecessary reconfiguration. The set of endpoints sent to Netscaler is not-incremental: the entire set is sent. This overcomes any problem with missed / dropped events and causes the Netscaler configuration to be eventually consistent.

Another choice made was to let the application / NetScaler admin provision the frontend details of the application. The front-end has myriad options such as lbmethod, persistence and stickiness. It is likely that each application has different needs for this configuration; also it is assumed to be chosen once and not dependent on size and scope of the backends.

You can find Nitrox  code and instructions here. The container is available on Docker Hub as chiradeeptest/nitrox. The containerized Nitrox can be scheduled like a regular workload on each of the container managers: docker run on Docker Swarm, an app on Marathon and a replication controller on Kubernetes.

Implementation Notes

Kubernetes ( and optionally: Docker Swarm) requires virtual networking (Kubernetes is usually  used with flannel). Therefore the container endpoints are endpoints on a virtual network. Since the NetScaler doesn’t participate in the virtual network (consider a non-virtualized NetScaler), this becomes a problem. For Docker Swarm, Nitrox assumes that bridged networking is used.

For Kubernetes, it is assumed that the service (app) being load balanced is configured to use NodePort style of exposing the service to external access. Kubernetes chooses a random port and exposes this port on every node in the cluster. Each node has a proxy that can provide access on this port. The proxy load balances the ingress traffic to each backend pod (container). One strategy then would be to simply configure the NetScaler to load balance to every node in the cluster. However, even if there are say 2 containers in the application but there are 50 nodes, then the Netscaler would needlessly send the traffic to many nodes. To make this more efficient, Nitrox figures out the list of nodes that the containers are actually running on and provisions these endpoints on the NetScaler.


Quick Tip: Docker Machine on Apache CloudStack and XenServer

There is now Docker Machine support for Apache CloudStack. See @atsaki‘s work at

docker-machine create -d cloudstack \
--cloudstack-api-url CLOUDSTACK_API_URL \
--cloudstack-api-key CLOUDSTACK_API_KEY \
--cloudstack-secret-key CLOUDSTACK_SECRET_KEY \
--cloudstack-template "Ubuntu Server 14.04" \
--cloudstack-zone "zone01" \
--cloudstack-service-offering "Small" \
--cloudstack-expunge \

Another way to do this is to launch your VM in CloudStack and then use the generic driver (assuming you have the private key from your sshkeypair):

docker-machine create -d generic \
--generic-ssh-key=SSH_PRIVATE_KEY  \


This will ALSO work for plain old VMs created on XenServer  (which currently does not have a driver).

Bonus: in either case you can use docker-machine to set up a Docker Swarm by adding the parameters:

--swarm \
--swarm-discovery token://\

Apache Mesos and Kubernetes on Apache CloudStack

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes. Mesos is used by a number of web-scale companies such as Twitter, Airbnb and even Apple.

Containers Layers

Cluster managers such as Mesos and Kubernetes are easier to set up than a full-blown IAAS stack: they do not orchestrate network, storage and other services. Plus they solve the problem of dividing (and bundling) the capacity of a single virtual machine into more useful chunks. Mesos can schedule containers on the cluster in addition to other workloads. Cluster managers are easy to setup on traditional virtualization infrastructure as well (check out Citrix Lifecycle Manager  for an example). But without persistent volumes, load balancers and other network services, cluster managers may not be able to tackle the full range of workloads handled by IAAS.

If you already have Apache CloudStack up and running and want to run a cluster manager on it, it just got easier. I used Packer and Terraform to completely automate  the provisioning of a full Mesos cluster. This recipe (here) first uses Packer to build a re-usable Ubuntu 14.04 image with the required packages installed (Zookeeper, Mesos, Marathon, etc). The Terraform configuration drives the creation of the cluster using this template.

Just for completeness, I have  Kubernetes-on-CloudStack automated as well (using Terraform). For better or worse, both Mesos and Kubernetes are rapidly evolving, so the automation may be broken by the time you are trying it out. Feel free to open a pull request to correct any errors.