A Tale of Two Deployments - Machine Images, Immutable Servers and Green/Blue Deployment

This is part 1 of a series of articles about deployment strategies. Part 2 can be found here, and talks more of the background of our two protagonists, and how they came to be here.

Ali walked into his office. It was the day of the big deployment, and the office was nervous. Deployments were hard! So many things to go wrong! Just as he started to hope that things would go his way, his boss appeared in front of him, looking flustered.

"The hype has driven up visitors way more than we anticipated! We need more servers!"

Oh no. Nobody had provisioned any servers since the existing ones were brought up - they had just been pushing code to them during deployments, and manually administering them when they needed to install new libraries, and such. Ali found the old provisioning scripts, but since they had not been kept updated, he had to battle with them to get 2 more servers up and running. There has to be an easier way to scale!

As Ali was walking into his office, Sally was just sitting down at her desk. They were going to have a deployment today, but it was no big deal. Their deployment pipeline had been designed to filter out almost everything that could go wrong during a deployment, so it did not bother her.

Coincidentally, they too were having a big release. Hype for their app had also driven traffic levels up past what the current servers could handle. Her boss had sent her an email asking her to increase the number of application servers, so she sat down and spun them up.

You see, her team used 'machine images' for deployment. When they deployed, they did not simply push the new code revision of their app to their servers. They first 'baked' a virtual machine image containing an entire OS including libraries and tools, and the application, ready to run. Then, they spawned up new servers using that machine image as a template.

This can be likened to creating many letters or flyers. It would be very inefficient and error-prone to write each one manually and expect them to be exactly the same - rather, a template would be created for the printing press, and used to create all the letters or flyers. To print a new version of the letters or flyers, the template would be updated, and all the letters or flyers printed again.

Since that machine image already existed, Sally was able to use it to spawn up a new set of servers and add them to the pool of existing ones. That same machine image had been used to create the existing servers already, so spawning new servers was guaranteed to work. Scaling with machine images was easy and worry-free.

This pipeline made sure that they had 'immutable servers' - servers that did not change once deployed. The only way to make changes was to redeploy new servers to replace them. Since the automation scripts were the only way to build the machine images for these servers, it guaranteed that nobody went onto a server and messed around with it manually, and thus they were always in-date.

If one of Ali's servers went down, he would have to manually try to fix it, or go through the same hassle he went through to rebuild it as he did when he had to bring up new ones.

If Sally had issues with a server, she just used the machine image to spawn up a new server to replace the faulty one. Her team slept better at night, especially since they automated that process, and made their server clusters 'auto-healing'.

Now, Ali was having trouble with his new servers. When he built them, he installed the latest version of ImageMagick available in the OS's repositories. Unfortunately, his company's app used an obscure feature of ImageMagick installed on the old servers, and the new version changed this behaviour, causing images to become corrupt when the new servers served the images. So sometimes a webpage would display correctly, and sometimes not. If only there was a way of making new servers exactly the same as existing ones.

Since Sally's team used machine images to spawn new servers every time a deployment was done, they just used the same machine image to spawn new servers, as we already saw. So they had the entire OS, including libraries, snapshotted at the time of baking the machine image, and had no problems with versions of libraries being different on different servers.

Also, Sally's build pipeline used the same machine image to deploy to staging as to production. So any issues with the OS, like library incompatibilities, could be filtered out in staging. Once staging worked, it was guaranteed that production would work. Thus, they were able to confidently update the OS on their servers regularly.

Sally's team later saw the ImageMagick issue as well during a deployment to staging, so they just created a fix, which made it's way into a new machine image. That machine image was used to spawn new staging servers, which the team checked did not have the same issue. They then deployed that new machine image to production, confident that there would be no issues.

Finally, Ali was about to press the proverbial button - deployment time. They had gotten through all the issues - ImageMagick versions, scaling, faulty servers. He pressed enter and waited with baited breath for deployment to finish. If it stopped half-way then the site would be down. If it took too long, some of the servers would continue to serve requests using old code until their turn to be updated came. If only there was a way to do this without all this stress, he thought...

Sally's team did not have this problem. They used a strategy called 'green/blue deployment' or 'red/black deployment' that allowed them to do zero-downtime deploys safely. When they deployed to production, they did not replace the existing production servers straight away. They kept the old cluster, and machine image, around. Instead, they used the new machine image to spawn up a completely new set of production servers, hidden from view. They would then access the app on these servers using a secret URL only they could access, and check the app worked as they expected. Once they were happy, they would start routing their users to the new cluster. If something went wrong during deployment, or an issue was found that had been missed on staging, they would be able to fix the problem without the users being affected, who were still using the the old servers.

This also allowed Sally's team to rollback easily - instead of having to mess around with production servers to move a symlink, or redeploy code, like Ali had to, her team simply started routing users to the old server cluster again.

What this teaches us is that having a robust pipeline that is designed to filter out issues with deployment is just as important as filtering out issues with code using testing. Deploying to staging is the test for deployment just like running tests is testing your code.

Deployments need not be stressful, fearful events, but routine tasks that have had all possible issues automated away. Infrastructure and deployment should be treated as a software problem that has to be worked on, solved, and improved, just like the problems we solve in our application code. This leads to less stress and worry, and more happiness, and thus more productivity in the long run. Neglecting them and just pushing code to manually maintained servers leads to mounting worry and constant procrastination, as a big task becomes bigger and bigger, until we are forced into action by a disastrous downtime event that costs us and our team dear.

There are other benefits to using machine images as well. Setting up a brand new environment is easy, since you just launch another server cluster with a machine image. It is then easy to perform a deployment per branch, so developers can show off their work in isolation. The machine image can be used to run a virtual machine on a developer's machine locally, easing the pains of dependency management for them. Continuous integration can be run in a server spawned from a machine image, further guaranteeing that the machine image will be without problems when it eventually makes it's way to production through the pipeline.

Asfand Qazi runs The DevOps Doctors, a DevOps consultancy. Click the button below to contact him for a free consultation if you need help with your infrastructure and deployment process.