7 Ways to Automate Kubernetes at Scale in Production

Jan 24th, 2018 10:22am by Craig Martin

Feature image via Pixabay.

The Kubernetes open source container orchestration engine is not a management platform, nor should it be mistaken for one. The whole point of orchestration is to reliably enable an automated system to facilitate the deployment and management of applications at scale, without the need for human intervention at each and every step. If the tools you use with and for Kubernetes don’t enable automation, then you’re not truly taking advantage of the benefits of orchestration.

To that end, here are seven ways you can and should be automating your Kubernetes cluster in production.

1) Logging

Any Kubernetes production environment will rely heavily on logs. At Kenzan, we typically try to separate out platform logging from application logging. This may be done via very different tooling and applications, or even by filtering and tagging within the logs themselves. As with any distributed system, logging provides the vital evidence for accurately tracing specific calls, even if they are on different microservices, so that a root cause may be identified.

2) Self-Healing

We believe it’s next to impossible for your system to achieve high uptime rates without self-healing capability, especially within a distributed environment. Kubernetes can regularly monitor the health of pods and containers, and take immediate actions to resolve what issues it encounters. Two of the object types that Kubernetes natively recognizes are podstatus and containerstatus.

[cycloneslider id=”kubernetes-series-book-1-sponsors”]

Container probes (livenessProbe and readinessProbe) let you define how you want Kubernetes to monitor if a container is alive and ready. The readiness probe is particularly helpful since it will actually leave pods up but serve them no traffic if the probe fails.

Be aware, however, that while self-healing features such as rebooting every half-hour or so are great to have, they can also mask a problem with your application. You need monitoring and logging functions that are robust enough to bubble up any issues that may occur.

3) Resilience Testing

Depending on the needs of your application (e.g., 99.999 percent uptime) resilience testing can and should be part of your platform. Failure at any level of your application should be recoverable, so that no one experiences any amount of downtime. In our experience, bulletproof applications are only feasible if development teams know in advance that their work will be put through extensive resilience testing.

Although you can conduct a type of resilience testing through the simplest of manual methods, such as manually shutting down databases or killing pods at random, our experience has proven these methods are much more effective when they’re automated. Although Netflix’s Chaos Monkey is a very powerful, tremendously useful resilience testing tool that runs in Amazon Web Services, it was not built for Kubernetes. Thankfully, there are emerging resilience testing frameworks in the Kubernetes sphere, two of which are fabric8 Chaos Monkey (part of the fabric8 integrated development environment) and kube-monkey.

4) Routine Auditing

No matter how many checks and balances you put in place, your Kubernetes production environment will benefit from routine maintenance and auditing. These audits will cover topics that normal monitoring will not cover. Traditionally, auditing is taken on as a manual process, but the automated tooling in this space is quickly and dramatically improving.

5) Autoscaling

Craig Martin, Kenzan

Craig Martin is Kenzan’s SVP of Engineering, where he leads engineering operations with a self-proclaimed mission of finding elegant solutions to complex challenges. In his role, Craig helps to lead the technical direction of the company ensuring that new and emerging technologies are explored and adopted into the strategic vision. Recently, Craig has been focusing on helping companies make a digital transformation by building large-scale microservice applications. He has a long history of custom software development in the professional services space and prior to Kenzan, Craig was director of engineering at Flatiron Solutions. He received his Bachelors of Science at George Mason University.

For Kubernetes, scaling typically means one of two things:

Scaling the pods
Scaling the nodes within the cluster

Scaling pods is definitely the most common form of scaling. This will add more instances of your services and make them ready to start accepting traffic. Typically pod level scaling is performed using Heapster metrics to decide whether new instances need to be created. We typically actually set our minimum pod number pretty low and trust the Kubernetes Horizontal Pod Autoscaler to correctly set the optimal number of replicas. We do always set the minimum greater than one replica per cluster to avoid a single point of failure situation.

Scaling the nodes is a rarer occurrence but can be a very useful scaling mechanism for highly elastic applications. The node scaling will require the backing IaaS (AWS, GCP, Etc) to scale and register into the Kubernetes cluster. This process can be a manual action, although we do not recommend that approach. Typically we use tooling that can automate the scaling of the individual nodes (example in Kubernetes Repo). The node level autoscaler will do two main actions, the first is to add more nodes when needed and the second is to remove nodes that are underutilized.

6) Resource Quotas

A resource quota lets you limit a namespace within your Kubernetes platform, ensuring that one application will not consume all the resources and impact other applications. Setting resource quotas can be a bit challenging. In our experience, we’ve found breaking down the namespaces by their expected load and using a ratio to calculate the percentage for the cluster is the most diplomatic way to begin. Running Heapster allows the use of the kubectl top {node | pod} command which shows the current node or pod resource usage which can sometimes help with quotas as well. From there, use monitoring and auditing to determine if your partitioning is correct.

7) Container Resource Constraints

Figuring out how many resources an individual container or pod will require has become something of an art. Historically, developer teams have made their estimates way more powerful than they need to be. We try to perform some level of load testing to see how it fails over, and then allocate resources appropriately. Netflix coined this method “squeeze testing.”

For more detailed tools and tips on how to automate Kubernetes in each of these areas, see our chapter on Issues and Challenges with Using Kubernetes in Production in “The State of the Kubernetes Ecosystem” ebook from The New Stack.