Building a management and observability plane for multi-cluster Kubernetes setup.

TLDR.

The client is a large-scale On-prem and Cloud Network Orchestration and Assurance software provider. Their flagship product is deployed as a kubernetes cluster which can scale to 100s of nodes and they are already running >50 of such clusters for their customers. As the client was growing they needed better control and observability across all such clusters all at once. The clusters were often deployed on-premise where the outside connectivity was a challenge.

One2N team had hands-on experience on working directly with data centers and building end to end observability for kubernetes clusters was able to solve the problem and help the customer to scale efficiently and confidently.

Business context.

Client’s platform was initially developed initially as a single tenant design which led to spawning up a new kubernetes cluster for each customer they signed up. This went great until the client hit the One to N growth phase. Then it became increasingly difficult to manage the operations across all the clusters and there was little visibility on the overall health of the system. Client had a great engineering team which understood the pain and started working on upgrading their product to be a truly multi-tenant platform. However this being a multi-month project, they still needed to get control of the existing problem. They client had a clear requirement for:

  • A control plane for managing kubernetes clusters: existing and new.
  • A single pane of glass view of monitoring health of each cluster and overall.

In addition to this, the client also wanted some expert advice on handling Disaster Recovery (DR) situations and Kubernetes tooling for achieving SOC Type 2 compliance.

Outcome/Impact

Ease of management of clusters with additional benefit of easier RBAC using a Highly Available Rancher setup that supports managed EKS and On-Prem Kuberenetes clusters.

Centralized observability giving insights into what’s happening inside individual tenant clusters as well as across all the clusters. 

Increased confidence and trust in the backup and recovery of Production clusters in case of a Disaster.

Centralize compliance policy enforcement by taking advantage of existing Rancher setup and open source technologies.

How One2N helped.

Client’s clear requirements helped us to identify challenges to work upon. We isolated the cluster management and observability problems from each other and worked parallely to deliver the fast.

Our experts evaluated different control plane tools and with prior experience with managing data centers, we were quickly able to assess which approaches can’t work because of the blocked ingress communication. We chose the battle tested and open source Rancher as our cluster management solution which works with almost all kubernetes orchestration engines available out there. This was needed because the client also had to manage kubernetes clusters on their own in on-premise setup. The open source solution offers the rich features of being able to manage K8s clusters, their nodes and all K8s resources using an intuitive web portal.

The added benefit of Rancher is that it also helps to streamline the access to the control plane from a single place. So instead of configuring EKS access from AWS IAM only for aws users, we can configure any user to access the rancher web portal and set role based permissions.

For Observability we divided the problem further into collecting metrics and logs.

- For logs, we proposed multiple options that could fit best and the client wished to use the EFK stack as they already had Elasticsearch expertise in-house.

- For metrics, Prometheus was chosen as the client had the monitoring within the cluster in-place. We extended this architecture by configuring Prometheus Remote Write setup which allowed us to collect metrics from individual clusters into a centralized prometheus.

- By collecting logs and metrics into a single observability setup, we built Dashboards in Grafana and Kibana for metrics and logs respectively. This offered a single pane of glass view for the client.

We also presented multiple proof of concepts for cluster backup and recovery to the client with a strong recommendation for Velero. It is a cloudnative CNCF incubated open source project. We demonstrated how to perform cluster recovery in case of a disaster using Velero and created and documented a DR plan.

Since the client was preparing for SOC Type 2 compliance we helped them understand what all could be done at the kubernetes cluster level. We did proof of concepts using OPA Gatekeeper and how that could be extended with the help of Rancher fleets to manage all kubernetes compliance from a single place.

Tech stack used.

See our other work.

Problem

Homogenous Monitoring across Cloud and On-premise

digital-world-banner-background-remixed-from-public-domain-by-nasa