A (round)trip with Java and Go

Can you generate Go code from a Java binary jar? This need arose from writing a prototype Ingress Controller  for Kubernetes that uses a NetScaler to provide the Ingress function. Another need was a Terraform driver for NetScaler. (While the NetScaler Ingress controller didn’t need a Golang client, it is customary to write Kubernetes integrations using Go).

NITRO is the REST API to program the Citrix NetScaler load balancer. The API is easy to use with well-defined usage patterns. Most commonly, JSON is used to create/update/read/delete configuration on the NetScaler.  There are Java and Python clients, (as well as PowerShell and Perl), but I needed a Golang client to NITRO.

There were a few roadblocks to producing the Go client. One problem is that the NITRO API is vast (over 1000 config objects with corresponding JSON definitions). Second, the JSON  documentation is in HTML docs, or, one has to resort to tools like Postman to reverse engineer the JSON schema.

After studying the NITRO Java SDK source, it occurred to me that  each config object in the REST API had a corresponding Java class. For example com.citrix.netscaler.nitro.resource.config.lb.lbvserver had fields that represented the possible fields in the JSON config object to configure a lbvserver 

The task was to generate a JSON v4 schema from the NITRO Java SDK. This was relatively easy after finding JJSchema . This project generates a JSON schema from Java classes. There were a few changes required to make JJSchema work with the NITRO jar: JJSchema assumed that all fields had an accessor of the form getFoo(): In the case of NITRO, it was get_foo(). JJSchema also relies on field annotations (@Attributes) to figure out metadata such as enums and read-only. For enums, the Java classes in the NITRO package had inner classes with static member constants, and figuring out the readonly attribute was a matter of finding fields that didn’t have set_foo() methods. The resulting changes are in a fork: https://github.com/chiradeep/JJSchema. The code to invoke JJSchema is in https://github.com/chiradeep/json-nitro. This involves determining all the subclasses of com.citrix.netscaler.nitro.resource.config and invoking the forked JJSchema on those classes.

Getting from JSON schema to Go involves another open source project (Generate). As the blurb says, Generate generates Go Structs from JSON schema. The generated schema is used in a (somewhat incomplete) Go client to the NITRO API (https://github.com/chiradeep/go-nitro).

Producing the JSON schema has the nice side-effect that it should be easier to write new clients (Ruby/Javascript anyone?) to the NITRO API. Who knows which language will catch the fancy of systems geeks a year from now (Rust?).

 

 

Consul-template and Citrix Netscaler

Consul-template (consul-template) is a tool that can drive reconfiguration of applications and infrastructure in response to changes in the keys/values stored in Consul. Usually it is used to populate changes into a local file in the filesystem. Following the change, the application or infrastructure software is usually restarted.

Previously I’d written up integrations between container managers such as Kubernetes and Netscaler. It was a relatively simple matter to include support for consul-template with a slight tweak. Netscaler only supports a REST-based configuration API (“Nitro“), so populating a config file on Netscaler was not going to do the job. The solution was simple: write a JSON file using consul-template and then ask consul-template to execute a python script to convert the JSON to Nitro API calls.

So:

consul-template -consul $CONSUL_IP:8500 -template consul_single_svc.ctmpl:cfg.json:"python main.py --cfg-file cfg.json

Here cfg.json is the intermediate JSON file produced by consul-template.

You can get the code here

Apple’s iCloud is a multi-cloud beast

Apple device users have probably taken and stored 100 billion photos:

  • In early 2013, the number was 9 billion
  • There are 100 million iPhones in active use in 2015. If each iPhone takes 1000 pictures per year, that’s 100 billion photos in 2015 alone.
  • Photos are automatically backed up to iCloud since iOS 5

I’d assumed that iCloud is a massive compute and storage cloud, operated like the datacenters of Google and Amazon.

Turns out that, at least for photo storage, iCloud is actually composed of Amazon’s S3 storage service and Google’s Cloud Storage service. I serendipitously discovered this while copying some photos from my camera’s SD card to my Macbook using the native Photos app. I’d recently installed  ‘Little Snitch‘ to see why the camera light on my Macbook turns on for no reason. Little Snitch immediately alerted me that Photos was trying to connect to Amazon’s S3 and Google’s Cloud Storage:

So it looks like Apple is outsourcing iCloud storage to two different clouds. At first glance this is strange: AWS S3 promises durability of 99.999999999%, so backing up to Google gains very little reliability for a doubling of cost.

It turns out that that AWS S3 and Google Storage are used differently:

For the approximately 200 hi-res photos that I was copying from my camera’s SD card, AWS S3 stores a LOT (1.58 GB), while Google stores a measly 50 MB. So Apple is probably using Google for something else. Speculation:

AWS S3 has an SLA of 99.99%. For the cases where it is unavailable (but photos are still safe), Google can be used to store / fetch low-res versions of the Photo stream.

The Google location could also be used to store an erasure code, although from the size, it seems unlikely.

Apple charges me $2.99 per month (reduced from $3.99 per month last fall) for 200GB of iCloud storage. Apple should be paying (according to the published pricing) between $2.50 and $5.50 per month to Amazon AWS for this. Add in a few pennies for Google’s storage, they are are probably break-even or slightly behind. If they were to operate their own S3-like storage, they would probably make a small -to- medium profit instead. I’ve calculated some numbers based on 2 MB per iPhone image.

Profit/Loss
per TB per month
2 PB – 1 billion
photos
20 PB – 10
billion photos
200 PB – 100
billion photos
2000 PB – 1
trillion photos
-$5 -$10,000 -$100,000 -$1,000,000 -$10,000,000
-$10 -$20,000 -$200,000 -$2,000,000 -$20,000,000
$10 $20,000 $200,000 $2,000,000 $20,000,000
$20 $40,000 $400,000 $4,000,000 $40,000,000

Given Apple’s huge profits of nearly $70 billion per year, paying Amazon about a quarter a billion for worry-free, infinitely scalable storage seems worth it.

I haven’t included the cost of accessing the data from S3, which can be quite prohibitive, but I suspect that Apple uses a content delivery network (CDN) for delivering the photos to your photo stream.

Untitled

Multi-cloud is clearly not a mythical beast. It is here and big companies like Apple are already taking advantage of it.

Save money on your AWS bill

A couple of years ago I was confronted with a bill of several hundred dollars because I’d forgotten to turn off some machines on AWS ( I think it was an ELB – elastic load balancer). Since then, I make it a point to login and check often to see if I’ve left stuff running. I’ve automated this simple check here: https://github.com/chiradeep/idle-instance-reaper

You can run the check using AWS Lambda as well. Just make sure you configure a ScheduledEvent trigger for it.

AWS_Lambda

Hope you save some money with this tip.

 

 

Automated configuration of NetScaler Loadbalancer for Kubernetes, Mesos and Docker Swarm

There are an incredible number of Cluster Managers for containerized workloads. The top clustered container managers include Google’s Kubernetes, the Marathon framework on Apache Mesos and Docker Swarm. While these managers offer powerful scheduling and autonomic capabilities, integration with external load balancers is often left as an exercise for the user. Since load balancers are essential components in a horizontally-scaled microservice this omission can impede the roll-out of your chosen container manager. Further, since a microservice architecture demands rapid deployment, any solution has to be able to keep up with the changes in the topology and structure of the microservice.

Citrix NetScaler is an application delivery controller widely used in load balancing applications at several Web-scale companies. This blog post describes Nitrox, a containerized application that can work with Docker Swarm or Kubernetes or Marathon (on Apache Mesos) to automatically reconfigure a Citrix NetScaler instance in response to application events such as deployment, rolling upgrades and auto/manual scale.

The figure below shows a scaling event that causes the number of backend containers for application α to grow from 4 to 6. The endpoints of the additional containers have to be provisioned into the NetScaler as a result.

scaleout

All cluster managers offer an event stream of application lifecycle events. In this case, Docker Swarm sends a container start event; Marathon sends a status_update_event with a taskStatus field and Kubernetes sends an endpoint update event. The job of Nitrox is to listen to these event streams and react to the events. In most cases this fires a query back to the cluster manager to obtain the new list of endpoints. This list of endpoints is then configured into NetScaler.

nitrox

The reconfiguration of Netscaler is idempotent and complete: if the endpoint already exists in the Netscaler configuration, it isn’t re-done. This prevents unnecessary reconfiguration. The set of endpoints sent to Netscaler is not-incremental: the entire set is sent. This overcomes any problem with missed / dropped events and causes the Netscaler configuration to be eventually consistent.

Another choice made was to let the application / NetScaler admin provision the frontend details of the application. The front-end has myriad options such as lbmethod, persistence and stickiness. It is likely that each application has different needs for this configuration; also it is assumed to be chosen once and not dependent on size and scope of the backends.

You can find Nitrox  code and instructions here. The container is available on Docker Hub as chiradeeptest/nitrox. The containerized Nitrox can be scheduled like a regular workload on each of the container managers: docker run on Docker Swarm, an app on Marathon and a replication controller on Kubernetes.

Implementation Notes

Kubernetes ( and optionally: Docker Swarm) requires virtual networking (Kubernetes is usually  used with flannel). Therefore the container endpoints are endpoints on a virtual network. Since the NetScaler doesn’t participate in the virtual network (consider a non-virtualized NetScaler), this becomes a problem. For Docker Swarm, Nitrox assumes that bridged networking is used.

For Kubernetes, it is assumed that the service (app) being load balanced is configured to use NodePort style of exposing the service to external access. Kubernetes chooses a random port and exposes this port on every node in the cluster. Each node has a proxy that can provide access on this port. The proxy load balances the ingress traffic to each backend pod (container). One strategy then would be to simply configure the NetScaler to load balance to every node in the cluster. However, even if there are say 2 containers in the application but there are 50 nodes, then the Netscaler would needlessly send the traffic to many nodes. To make this more efficient, Nitrox figures out the list of nodes that the containers are actually running on and provisions these endpoints on the NetScaler.

 

Quick Tip: Docker Machine on Apache CloudStack and XenServer

There is now Docker Machine support for Apache CloudStack. See @atsaki‘s work at https://github.com/atsaki/docker-machine-driver-cloudstack

docker-machine create -d cloudstack \
--cloudstack-api-url CLOUDSTACK_API_URL \
--cloudstack-api-key CLOUDSTACK_API_KEY \
--cloudstack-secret-key CLOUDSTACK_SECRET_KEY \
--cloudstack-template "Ubuntu Server 14.04" \
--cloudstack-zone "zone01" \
--cloudstack-service-offering "Small" \
--cloudstack-expunge \
docker-machine

Another way to do this is to launch your VM in CloudStack and then use the generic driver (assuming you have the private key from your sshkeypair):

docker-machine create -d generic \
--generic-ip-address=VM_IP\
--generic-ssh-key=SSH_PRIVATE_KEY  \

--generic-ssh-user=SSH_USER

This will ALSO work for plain old VMs created on XenServer  (which currently does not have a driver).

Bonus: in either case you can use docker-machine to set up a Docker Swarm by adding the parameters:

--swarm \
--swarm-discovery token://\

Apache Mesos and Kubernetes on Apache CloudStack

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes. Mesos is used by a number of web-scale companies such as Twitter, Airbnb and even Apple.

Containers Layers

Cluster managers such as Mesos and Kubernetes are easier to set up than a full-blown IAAS stack: they do not orchestrate network, storage and other services. Plus they solve the problem of dividing (and bundling) the capacity of a single virtual machine into more useful chunks. Mesos can schedule containers on the cluster in addition to other workloads. Cluster managers are easy to setup on traditional virtualization infrastructure as well (check out Citrix Lifecycle Manager  for an example). But without persistent volumes, load balancers and other network services, cluster managers may not be able to tackle the full range of workloads handled by IAAS.

If you already have Apache CloudStack up and running and want to run a cluster manager on it, it just got easier. I used Packer and Terraform to completely automate  the provisioning of a full Mesos cluster. This recipe (here) first uses Packer to build a re-usable Ubuntu 14.04 image with the required packages installed (Zookeeper, Mesos, Marathon, etc). The Terraform configuration drives the creation of the cluster using this template.

Just for completeness, I have  Kubernetes-on-CloudStack automated as well (using Terraform). For better or worse, both Mesos and Kubernetes are rapidly evolving, so the automation may be broken by the time you are trying it out. Feel free to open a pull request to correct any errors.

Farming your CloudStack cloud

A couple of years ago, I blogged about my prototype of StackMate, a tool and a service that interprets AWS CloudFormation-style templates and creates CloudStack resources. The idea was to provide an application management solution. I didn’t develop the idea beyond a working prototype. Terraform from Hashicorp is a similar idea, but with the ability to add extensions (providers)  to drive resource creation in different clouds and service providers. Fortunately Terraform is solid and widely used. Even better, Sander van Harmelen (@_svanharmelen_) has written a well-documented CloudStack provider.

Terraform templates have a different (but json-style) syntax than AWS Cloudformation, which among other things lets you add comments. Like StackMate, it figures out the order of resource creation by creating a dependency graph. You can also add explicity “depends_on” relationships. I played around with Terraform and created a couple of templates here:

https://github.com/chiradeep/terraform-cloudstack-examples

One template creates a VPC and 2 subnets and 2 VMS. The other template creates 2 isolated networks and a couple of VMs (one with nics on both networks).

Pull requests accepted.

While there are awesome services and products out there that can do similar things (RightScale, Scalr, Citrix Lifecycle Management), it is great to see something open sourced and community-driven.

s_g12ajgegltb0496-0

How HP Labs nearly invented the cloud

On the heels of HP’s news of not-quite abandoning the Cloud, there is coverage of how AWS stole a march on Sun’s plans to provide compute-on-demand. The timeline for AWS starts late 2003 when an internal team in Amazon hatched a plan that among other things could offer virtual servers as a retail offering. Sun’s offering involved bare metal and running jobs, not virtual machines.

In a paper published in 2004 a group of researchers at HP Labs proposed what they called “SoftUDC” – a software-based utility data center. The project involved:

  • API access  to virtual resources
  • Virtualization using the Xen Hypervisor
  • Network virtualization using UDP overlays almost identical to VxLAN
  • Virtual Volumes accessible over the network from any virtual machine (like EBS)
  • “Gatekeeper” software in the hypervisor that provides the software and network virtualization
  • Multi-tier networking using subnetting and edge appliances (“VPC”)
  • Automated OS and application upgrades using the “cattle” technique (just replace instead of upgrade).
  • Control at the edge: firewalls, encryption and QoS guarantees provided at the hypervisor

Many of these ideas now seem “obvious”, but remember this was 2004. Many of these ideas were even implemented. For example, VNET is the name of the network virtualization stack / protocol. This was implemented as a driver in Xen dom0 that would take Ethernet frames exiting the hypervisor and encapsulate them in UDP frames.

Does this mean HP could have been the dominant IAAS player instead of AWS if it only had acted on its Labs innovation? Of course not. But, lets say in 2008 when AWS was a clear danger, it could’ve dug a little deeper inside its own technological inventory to produce a viable competitor early on.  Instead we got OpenStack.

Many of AWS’s core components are based on similar concepts: the Xen hypervisor, network virtualization, virtual volumes, security groups, and so on. No doubt they came up with these concepts on their own — more importantly they implemented them and had a strategy for building a business around it.

Who knows what innovations are cooking today in various big companies, only to get discarded as unviable ideas. This can be framed as the Innovator’s Dilemma as well.

How to manage a million firewalls – part 2

Continuing from my last post where I hinted about the big distributed systems problem involved in managing a CloudStack Basic Zone.

It helps to understand how CloudStack is architected at a high level. CloudStack is typically operated as a cluster of identical Java applications (called the “Management Server” or “MS”). There is a MySQL database that holds the desired state of the cloud. API calls arrive at a management server (through a load balancer). The management server uses the current state as stored in the MySQL database, computes/stores a new state and communicates any changes to the cloud infrastructure.

sg_groups_pptx8

In response to an API call, the management server(s) usually have to communicate with one or more hypervisors. For example, adding a rule to a security group (a single API call)  could involve communicating changes to dozens or hundreds of hypervisors. The job of communicating with the hypervisors is split (“sharded”) among the cluster members. For example if there’s 3000 hypervisors and 3 management servers, then each MS handles communications with 1000 hypervisors. If the API call arrives at MS ‘A’ and needs to update a hypervisor managed by MS ‘B’, then the communication is brokered through B.

Now updating a thousand firewalls  (remember, the firewalls are local to the hypervisor) in response to a single API call requires us to think about the API call semantics. Waiting for all 1000 firewalls to respond could take a very long time. The better approach is to return success to the API and work in the background to update the 1000 firewalls. It is also likely that the update is going to fail on a small percentage of the firewalls. The update could fail due to any number of problems: (transient) network problems between the MS and the hypervisor, a problem with the hypervisor hardware, etc.

This problem can be described in terms of the CAP theorem as well. A piece of state (the state of the security group) is being stored on a number of distributed machines (the hypervisors in this case). When there is a network partition (P), do we want the update to the state to be Consistent (every copy of the state is the same), or do we want the API to be Available (partition-tolerant).  Choosing Availability ensures that the API call never fails, regardless of the state of the infrastructure. But it also means that the state is potentially inconsistent across the infrastructure when there is a partition.

A lot of the problems with an inconsistent state can be hand-waved away1 since the default behavior of the firewall is to drop traffic. So if the firewall doesn’t get the new rule or the new IP address, it means that inconsistency is safe: we are not letting in traffic that we didn’t want to.

A common strategy in AP systems is to be eventually consistent. That is, at some undefined point in the future, every node in the distributed system will agree on the state. So, for example, the API call needs to update a hundred hypervisors, but only 95 of them are available. At some point in the future, the remaining 5 do become available and are updated to the correct state.

When a previously disconnected hypervisor reconnects to the MS cluster, it is easy to bring it up to date, since the authoritative state is stored in the MySQL database associated with the CloudStack MS cluster.

A different distributed systems problem is to deal with concurrent writes. Let’s say you send a hundred API calls in quick succession to the MS cluster to start a hundred VMs. Each VM creation leads to changes in many different VM firewalls. Not every API call lands on the same MS: the load balancer in front of the cluster will distribute it to all the machines in the cluster. Visualizing the timeline:

sg_groups_pptx9

A design goal is to push the updates to the VM firewalls as soon as possible (this is to minimize the window of inconsistency). So, as the API calls arrive, the MySQL database is updated and the new firewall states are computed and pushed to the hypervisors.

While MySQL concurrency primitives allow us to safely modify the database (effectively serializing the updates to the security groups), the order of updates to the database may not be the order of updates that flow to the hypervisor. For example, in the table above, the firewall state computed as a result of the API call at T=0 might arrive at the firewall for VM A after the firewall state computed at T=2. We cannot accept the “older” update.sg_groups_pptx10

The obvious2 solution is to insert the order of computation in the message (update) sent to the firewall. Every time an API call results in a change to the state of a VM firewall, we update a persistent sequence number associated with that VM. That sequence number is transmitted to the firewall along with the new state. If the firewall notices that the latest update received is “older” than the one it is has already processed, it just ignores it. In the figure above, the “red” update gets ignored.

An crucial point is that every update to the firewall has to contain the complete state: it cannot just be the delta from the previous state3.

The sequence number has to be stored on the hypervisor so that it can compare the received sequence number. The sequence number also optimizes the updates to hypervisors that reconnect after a network partition has healed: if the sequence number matches, then no updates are necessary.

Well, I’ve tried to keep this part under a thousand words. The architecture discussed here did not converge easily — there was a lot of mistakes and learning along the way. There is no way for other cloud / orchestration systems to re-use this code, however, I hope the reader will learn from my experience!


1. The only case to worry about is when rules are deleted: an inconsistent state potentially means we are allowing traffic when we didn’t intend to. In practice, rule deletes are a very small portion of the changes to security groups. Besides if the rule exists because it was intentionally created — it probably is OK to take a little time to delete it
2. Other (not-so-good) solutions involve locks per VM, and queues per VM
3. This is a common pattern in orchestrating distributed infrastructure