Building a management and observability plane for multi-cluster Kubernetes setup.

Context.

The client is a large-scale On-prem and Cloud Network Orchestration and Assurance software provider. Their flagship product is deployed as a Kubernetes cluster which can scale to 100s of nodes, and they are already running >50 of such clusters for their customers. As the client grew, they simultaneously needed better control and observability across all such clusters. The clusters were often deployed on-premise, where external connectivity was a challenge.

One2N team had hands-on experience working directly with data centres and building end-to-end observability for Kubernetes clusters and was able to solve the problem and help the customer to scale efficiently and confidently.

Problem Statement.

The client's platform was initially developed as a single-tenant design, spawning up a new Kubernetes cluster for each customer they signed up for. This worked great until the client hit the One to N growth phase. Then it became increasingly difficult to manage the operations across all the clusters, and there was little visibility of the system's overall health. The client had a great engineering team which understood the pain and started working on upgrading their product to be a truly multi-tenant platform. However, this being a multi-month project, they still needed to get control of the existing problem. The client had a clear requirement for the following:

A control plane for managing Kubernetes clusters: current and new.
A single pane of glass view of monitoring the health of each cluster and overall.

In addition to this, the client also wanted some expert advice on handling Disaster Recovery (DR) situations and Kubernetes tooling for achieving SOC Type 2 compliance.

Outcome/Impact.

Ease of management of clusters with the additional benefit of easier RBAC using a Highly Available Rancher setup that supports managed EKS and On-Prem Kubernetes clusters.

Centralized observability gives insights into what's happening inside an individual tenant cluster and across all the clusters.

Increased confidence and trust in the backup and recovery of Production clusters in case of a Disaster.

Centralize compliance policy enforcement by using existing Rancher setup and open source technologies.

Solution.

The client's precise requirements helped us to identify challenges to work upon. We isolated the cluster management and observability problems from each other and worked in parallel to deliver the fast.

Our experts evaluated different control plane tools, and with prior experience with managing data centres, we quickly assessed which approaches couldn't work because of the blocked ingress communication. We chose the battle-tested and open-source Rancher as our cluster management solution, which works with almost all Kubernetes orchestration engines. This was needed because the client also had to manage Kubernetes clusters independently in an on-premise setup. The open-source solution offers the rich features of managing K8s clusters, their nodes and all K8s resources using an intuitive web portal.

The added benefit of Rancher is that it also helps to streamline access to the control plane from a single place. So instead of configuring EKS access from AWS IAM only for aws users, we can configure any user to access the rancher web portal and set role-based permissions.

For Observability, we divided the problem further into collecting metrics and logs.
- For logs, we proposed multiple options that could fit best, and the client wished to use the EFK stack as they already had Elasticsearch expertise in-house.
- For metrics, Prometheus was chosen as the client had the monitoring within the cluster in place. We extended this architecture by configuring the Prometheus Remote Write setup, which allowed us to collect metrics from individual clusters into a centralized Prometheus.
- By collecting logs and metrics into a single observability setup, we built Dashboards in Grafana and Kibana for metrics and logs, respectively. This offered a single pane of glass view for the client.

We also presented the client with multiple proof of concepts for cluster backup and recovery with a strong recommendation for Velero. It is a cloud-native CNCF incubated open-source project. We demonstrated how to perform cluster recovery in case of a disaster using Velero and created and documented a DR plan.

Since the client was preparing for SOC Type 2 compliance, we helped them understand what could be done at the Kubernetes cluster level. We did proof of concepts using OPA Gatekeeper and how that could be extended with the help of Rancher fleets to manage all Kubernetes compliance from a single place.

Tech stack used.

See our other work.

Microservices

Problem

Homogenous Monitoring across Cloud and On-premise

See our solution