Cost Effectively Scaling Your EKS Cluster with Karpenter
Gabriel Yahav
September 12, 2024
Share

When it comes to running production Kubernetes workloads, having a scalable infrastructure isn’t just a nice-to-have — it’s a necessity for any company that’s serious about providing a cloud-native SaaS offering. While it’s possible to manually pump up your Kubernetes worker nodes in those early days of light traffic, let’s be honest: that’s like putting a Band-Aid on a bullet wound in the high-stakes world of real-world, high-demand scenarios.

At PointFive, we embarked on our EKS journey with high hopes and a single Managed Node Group, thinking an Amazon Linux 2 AMI node could gracefully juggle both our application pods and Kubernetes add-ons. But, as our pod population exploded and their demands for memory and CPU grew louder, it didn’t take long for us to realize that our data plane needed a serious upgrade — or we’d risk being outpaced by our own success!

We had to figure out the smartest way to scale our Kubernetes cluster. Here’s what we took into account:

  • Scaling with Simplicity: We needed to ensure that our Kubernetes worker nodes could scale efficiently to the best instances for our workloads, without getting bogged down in the nitty-gritty of specific instance classes and types.
  • FinOps Focus: As a startup with a keen eye on FinOps, our mission is to run our workloads on the most cost-effective instances out there, maximizing savings while keeping performance in check.

We found ourselves weighing the options between Kubernetes Cluster Autoscaler and Karpenter for horizontal node scaling. Here are the features we considered when comparing our options.

Karpenter vs. Kubernetes Cluster Autoscaler


Feature/Capability

Karpenter

Kubernetes Cluster Autoscaler

Node Provisioning & Instance Selection

Dynamically provisions nodes based on workload requirements, supporting a wide range of instance types and sizes by selecting the most cost-effective and performance-appropriate instance types based on real-time data and capacity requirements.

Scales up by adding nodes from pre-defined node groups (e.g., ASGs in AWS).

Startup Time

Faster provisioning of new nodes; optimized to minimize the time to launch new instances because it doesn't rely on traditional node groups.
Slower, as it relies on scaling existing node groups, which might have fixed sizes and configurations.

Mixed Instances

Supports mixed instance types within the same provisioning action, optimizing cost and performance.
Limited to predefined node group types; requires multiple node groups for diversity.

Node Consolidation

Actively consolidates workloads by terminating underutilized nodes and rescheduling pods to optimize resource use.
Basic functionality through scheduled scaling down of underutilized nodes.
Cost Optimization
Automatically selects the lowest-cost instances (including Spot) based on real-time availability and workload requirements.
Limited to the cost structure of predefined node groups; Spot instances require manual configuration.

While the Kubernetes Cluster Autoscaler is the more established option, its built-in restrictions ultimately steered us toward Karpenter.

Launched by our partner AWS in December 2021, Karpenter is an open-source Kubernetes cluster autoscaler that breaks free from the limitations of its predecessor. It dynamically provisions the ideal compute resources based on workload demands, offering flexibility by selecting from a wide array of EC2 instance types, including budget-friendly Spot instances.

How We Rolled Out Karpenter

Once we made our decision, we needed to execute. We did this through a few different methods tailored to our needs.

  • Dedicated Scheduling: We continued to schedule our Kubernetes Add-on pods on the EKS Managed Node Group, while leaving the scheduling of our application workloads entirely in Karpenter's hands.
  • Streamlined Karpenter Setup: Our Karpenter configuration features a single primary NodePool. This NodePool serves as a blueprint for generating nodes for unscheduled pods, defining how Karpenter should dynamically provision them. It specifies criteria such as instance types, capacity types, and other configurations, ensuring unscheduled pods land on the most suitable nodes efficiently.

1apiVersion: karpenter.sh/v1
2kind: NodePool
3metadata:
4  name: main
5spec:
6  disruption:
7    consolidationPolicy: WhenEmptyOrUnderutilized
8    consolidateAfter: 1m
9  limits:
10    cpu: 1k
11    memory: 1000Gi
12  template:
13    metadata: {}
14    spec:
15      nodeClassRef:
16        group: karpenter.k8s.aws
17        kind: EC2NodeClass
18        name: main
19      expireAfter: 720h
20      requirements:
21      - key: topology.kubernetes.io/zone
22        operator: In
23        values:
24        - us-east-1a
25        - us-east-1b
26        - us-east-1c
27      - key: kubernetes.io/arch
28        operator: In
29        values:
30        - arm64
31      - key: karpenter.sh/capacity-type
32        operator: In
33        values:
34        - on-demand
35      - key: kubernetes.io/os
36        operator: In
37        values:
38        - linux
39      - key: karpenter.k8s.aws/instance-category
40        operator: In
41        values:
42        - c
43        - m
44        - r
45      - key: karpenter.k8s.aws/instance-generation
46        operator: Gt
47        values:
48        - "2"
49---
50apiVersion: karpenter.k8s.aws/v1
51kind: EC2NodeClass
52metadata:
53  name: main
54spec:
55  amiFamily: AL2
56  blockDeviceMappings:
57    - deviceName: /dev/xvda
58      ebs:
59        encrypted: true
60        volumeSize: 75Gi
61        volumeType: gp3
62  role: ${karpeneter_node_iam_role_name}
63  securityGroupSelectorTerms:
64    - id: "${cluster_security_group_id}"
65  subnetSelectorTerms:
66    - tags:
67        Type: Private
68  amiSelectorTerms:
69    - alias: al2@latest

Key Considerations for Using Karpenter

While Karpenter made the most sense for our setup, we still needed to consider a few things to ensure we could make the most of our scaler. Our team created a list of key considerations to keep in mind as we launched and continued to use it.

  1. Minimize NodePools: Typically, a single NodePool can serve multiple teams effectively. Extra NodePools are only necessary if you need to isolate nodes for billing, apply specific constraints (like excluding GPUs for certain teams or using different CPU architectures), or set distinct disruption policies—none of which seem to apply here.

  2. Set Node Expiration: Karpenter recommends configuring expireAfter for a NodePool. This setting ensures that nodes are drained and replaced once they reach their expiration, boosting security by regularly cycling nodes with updated AMIs (Amazon Machine Images) that come with the latest security patches.

  3. Use a Broad Instance Type & Size Range: Keep the NodePool’s instance type and size requirements broad. This gives Karpenter the flexibility to choose the most cost-effective and available instances that align with your workload’s demands.

  4. Enable WhenEmptyOrUnderutilized Consolidation: Activate this mode to allow Karpenter to lower cluster costs by identifying and disrupting nodes that are “empty” (no pods running on them) or underutilized (when cheaper alternatives are available).

Embracing the Future: Staying Ahead with Karpenter’s Evolution

As we wrap up our exploration of scaling EKS clusters cost-effectively with Karpenter, remember: This journey never truly ends. Karpenter isn’t just a tool; it’s your backstage pass to a leaner, meaner, and more efficient Kubernetes environment. And a more efficient Kubernetes environment enables better cloud cost optimization for your entire infrastructure.

So, dive in, play around with the configurations, and let Karpenter handle the heavy lifting. With the right approach, you'll scale up without scaling out of control. Keep experimenting, stay agile, and let your cluster shine, all while staying tuned to Karpenter’s continuous evolution with new versions and features.

Interested in learning more about cloud cost optimization solutions?

Check out our blog for articles from our team.

Share
Stay connected
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Find out more
How PointFive Enabled Cloud Cost Ownership and Action for Nubank Engineers
Read more
Our Future in Cloud Cost Optimization: A New Milestone
Read more
PointFive Secures $20M In Series A Funding to Accelerate Multi-Cloud Support
Read more
STARTING POINT

Discover deeper cloud efficiency with PointFive.

Book a Demo