Designing a Scalable Deployment Pipeline

Anyone who’s led a product engineering team knows that a growing team requires investments in process, communication approaches, and documentation. These investments help new people get up to speed, become productive quickly, stay informed about what the rest of the team is doing, and codify tribal knowledge so it doesn’t leave with people.

One thing that receives less investment when a team scales is its deployment pipeline–the tools and infrastructure for deploying, testing, and running in production. Why are these investments lacking even when the team can identify the pain points? My theory is that it nearly always feels too expensive in terms of both money and lost progress on building features.

Following that theory, I now consider designing an effective and scalable deployment pipeline to be the first priority of a product engineering team—even higher than choosing a language or tech stack. The same staging/production design that was my standard just a few years ago now seems unacceptable.

What is a Deployment Pipeline?

Before we dive into what our deployment pipelines used to look like, let’s start by defining a few terms.

A deployment pipeline includes the automation, deploy environments, and process that supports getting code from a developer’s laptop into the hands of an end user.

A deploy environment is a named version of the application. It can be uniquely addressed or installed by a non-developer team member. A developer can distinctly deploy an arbitrary version of the underlying codebase to it. Often, distinct deploy environments will also have unique sets of backing data.

A deployment process is the set of rules the team agrees upon regarding hand-off, build promotion between environments, source control management, and new functionality verification.

Automation is the approach to making mundane parts of the deployment process executable by computers as a result of a detectable event (i.e. source control commit) or manual push-button trigger.

Our Old Approach: A Hot Mess

In the recent past, our goto template for a web app deployment pipeline utilized two deployment environments: staging and production.

Process

The process for utilizing these environments looked something like this:

  1. Developer works on a feature locally until it’s ready to be integrated and accepted.
  2. Developer integrates it with the version of the app on staging and deploys it to the staging environment.
  3. Delivery lead verifies that the feature is acceptable by functionally testing it in the staging environment.
  4. Delivery lead gives developer feedback for improvement or approves it as done.
  5. At some point, the developer deploys the features on staging to production.

Automation

We’d also, minimally, automate deployment of an arbitrary version of the app from a developer’s laptop to either environment.

Result

This deployment pipeline is straightforward and easy to implement–but it’s not easy to scale if, for example, you need to grow your dev team. Or if you support a heavily used production deployment while simultaneously developing new product functionality.

The most common sign that a prod/staging pipeline is breaking down due to scaling demands is integration pain felt by the delivery lead in Step 3 of the process above. Multiple developers pile their feature updates and bug fixes onto the staging environment. Staging starts to feel like a traffic accident on top of a log jam.  It’s a mix of verified and unverified bug fixes and accepted/brand new feature enhancements. This results in regressions for which the root cause cannot be easily found. Since it’s all on staging, a delivery lead doesn’t know which change is a likely culprit, and they’re probably not sure which developer should investigate it.

It’s a hot mess.

In this scenario, the staging environment rarely provides a sense of confidence for the upcoming production deployment. Rather, it foretells the disaster your team is likely to encounter once you go live.

We Can Do Better

If we look at this problem through the lens of the theory of constraints, it’s obvious that the staging deploy environment is the pipeline’s constraint/bottleneck.

We don’t want to drop staging because it provides a valuable opportunity to validate app changes just outside of the live environment. Instead, we want to optimize for staging to provide the most value possible–that being:

Provide a deploy environment identical to production except for one or two changes which can be verified one last time right before deploying them to production.

This definition of value implies that the staging environment spends a lot of time looking just like production, which is good. A clean staging environment is an open highway for the next feature or bug fix to be quickly deployed to production with confidence.

Deployed Dev Environments

To minimize the time a new feature spends on staging, we introduced new deploy environments which we call dev environments. These aren’t the same as local dev environments. A deploy environment needs to be uniquely addressable by the delivery lead-it can’t just be running on your laptop. The number of dev environments is fluid, scaling with the number of developers and number of in-progress features and updates.

Process

If you think of staging as a clone of production, then think of a dev environment as a clone of staging. The new process looks like this:

  1. Developer works on a feature locally until it’s ready to be integrated and accepted.
  2. Developer spins up a dev environment (cloned from staging) and deploys a change to it.
  3. Delivery lead verifies the feature is acceptable by functionally testing it in the dev environment.
  4. Delivery lead gives developer feedback for improvement or approves it as done.
  5. Developer deploys change to staging and shuts down dev environment.
  6. Delivery lead spot checks change in staging and deploys it to production.

The main difference in our process is moving the iteration on feature acceptance feedback from upstream from the staging environment to the dev environments. This allows staging to be a clean clone of production most of the time and lets us validate multiple updates in parallel isolated environments. The fact that features can validated in isolated environments means we can more easily identify the root cause of a defect or regression resulting from a recent change.

The idea of on-demand deploy environments may be uncommon, but it’s not new. Atlasssian called them rush boxes. Github called them staff servers and let developers spin them up with hubot commands.

Automation

In addition to automating deployment, we’ll need to automate the creation of a new dev environment to support this pipeline. Ideally, it should be a clone of staging and uniquely addressable (e.g. dev1.app.com, dev2.app.com, etc.).

Say you’re managing your deploy environments in a cloud service like AWS. Automating this process is doable with, at most, a few weeks of investment. As a stop gap, your team could also spin up a set of dev servers (one per developer) and try to suspend their respective computing resources (i.e. EC2 instance) when they’re not in use.

In 2014, we started implementing this pipeline design on top of Heroku. This made cloning environments really easy via the built-in ability to fork a copy of an app.

The Golden Triforce of Deployment Tools

Today, if you use GitHub and Heroku, you can get everything I described above right out of the box with Heroku Pipelines and Heroku Review Apps. Because of this, GitHub + Heroku is a killer stack for teams focused on building their product over their infrastructure.

I’d also throw in CircleCI for continuous integration. It’s a nearly zero-conf CI service that can automatically parallelize your slow test suite and execute it in parallel. All of these tools do a great job guiding a team to build a portable app. This makes it easy to move to another platform later, like AWS.

Deploying with Confidence

In summary: Use GitHub + Heroku + CircleCI unless you have a really good reason not to. Keep staging clean with on-demand dev environments. Deploy with confidence.

 
Conversation
  • Tomer says:

    Nice! You solved the bottle neck in staging by creating a new environment for each developer on which his feature will be verified.
    In this solution what is the added value of the staging environment? Someone need to verify the feature twice? one time of dev env and one on staging? unless you want you feature to be alive in staging env which get some real traffic.

  • Comments are closed.