Skip to main content

Command Palette

Search for a command to run...

Migrating GitLab Runners from EKS Fargate to EKS Auto Mode: A 40% Cost Reduction Journey

How we reduced our CI/CD infrastructure costs by switching to EKS Auto Mode with Spot instances, and why Infrastructure as Code was essential for production reliability

Published
16 min read
Migrating GitLab Runners from EKS Fargate to EKS Auto Mode: A 40% Cost Reduction Journey

The Problem: Fargate Costs Were Adding Up

For over a year, we ran our GitLab Runners on Amazon EKS with AWS Fargate. The serverless approach was appealing—no nodes to manage, automatic scaling, and a simple mental model. Each CI/CD job spun up as a Fargate pod, ran its tests, and disappeared.

But as our engineering team grew and pipeline frequency increased, the monthly bill told a different story:

Component Monthly Cost
Fargate vCPU hours ~$115
Fargate memory hours ~$28
EKS control plane ~$74
NAT Gateway ~$77
Supporting infrastructure ~$90
Total ~$385/month

The Fargate compute alone was costing us $143/month for what amounted to intermittent CI/CD workloads. Our pipelines ran maybe 4-6 hours of actual compute per day, yet we were paying premium serverless pricing for every second.

Why Fargate Becomes Expensive for CI/CD

Fargate pricing in eu-west-1:

  • $0.04048 per vCPU per hour

  • $0.004445 per GB memory per hour

For a typical CI job requesting 2 vCPU and 4GB memory running for 15 minutes:

  • Fargate cost: ~$0.034 per job

  • Equivalent spot instance (m6a.large): ~$0.007 per job

That's nearly 5x more expensive than spot instances for the same workload.


Why Run GitLab Runners on Kubernetes (EKS)?

Before diving into our solution, it's worth understanding why we chose Kubernetes in the first place. There are several ways to run GitLab Runners on AWS:

Option 1: EC2 Instances (Docker/Shell Executor)

The traditional approach—run GitLab Runner directly on EC2 instances using the Docker or Shell executor.

Pros:

  • Simple to set up and understand

  • Full control over the environment

  • Works with existing EC2 knowledge

Cons:

  • Manual scaling: You manage autoscaling groups, launch templates, lifecycle hooks

  • Resource waste: Instances run 24/7 or you build complex scaling logic

  • Docker-in-Docker issues: Security concerns with privileged containers

  • Maintenance burden: OS patching, Docker updates, runner upgrades are your responsibility

  • No bin-packing: Each instance typically runs one job at a time (or complex queue management)

Option 2: ECS with Fargate or EC2

Amazon ECS offers container orchestration without Kubernetes complexity.

Pros:

  • Native AWS integration

  • Fargate provides serverless containers

  • Simpler than Kubernetes for basic use cases

Cons:

  • No native GitLab integration: GitLab Runner doesn't have an ECS executor—you'd need custom solutions

  • Task definition management: More complex than Kubernetes pods for dynamic workloads

  • Limited ecosystem: Fewer community tools and patterns compared to Kubernetes

  • Vendor lock-in: ECS is AWS-specific; Kubernetes skills transfer across clouds

Option 3: EKS with Kubernetes Executor (Our Choice)

GitLab Runner's Kubernetes executor is purpose-built for CI/CD on Kubernetes.

Pros:

  • Native GitLab integration: First-class support in GitLab Runner

  • Automatic pod lifecycle: Each job gets a fresh pod, automatic cleanup

  • Bin-packing: Multiple jobs share nodes efficiently

  • Karpenter/Auto Mode: Intelligent, automatic node provisioning in seconds

  • Ecosystem benefits: Helm charts, operators, monitoring tools

  • Portability: Same configuration works on any Kubernetes cluster

  • Security: Pod security standards, network policies, IRSA for AWS access

Cons:

  • Kubernetes complexity: Learning curve if you're new to K8s

  • More moving parts: Nodes, pods, services vs. just EC2 instances

Why EKS Won for Us

The deciding factors were:

  1. GitLab's Kubernetes executor is mature: Handles job isolation, artifact management, and service containers natively

  2. Karpenter changes everything: No more managing autoscaling groups—Karpenter provisions exactly what you need in seconds

  3. Cost efficiency through bin-packing: Multiple CI jobs share a single node, maximizing utilization

  4. Team skills: We already run production workloads on Kubernetes

  5. Future flexibility: Easy to migrate to another cloud or on-premises if needed


Why EKS Auto Mode with Spot Over Standard EKS with Fargate?

This is the key architectural decision. Both approaches run on EKS, but they have fundamentally different characteristics.

EKS with Fargate: The Serverless Promise

Fargate abstracts away nodes entirely. You define pod specs, and AWS handles the compute.

How Fargate works for GitLab Runners:

Job triggered → GitLab Runner creates pod → Fargate provisions microVM → Job runs → Pod terminates

Fargate Advantages:

  • Zero node management

  • Per-second billing

  • Strong isolation (each pod is a separate microVM)

  • No capacity planning

  • Automatic security patching

Fargate Disadvantages for CI/CD:

  • Premium pricing: 20-40% more expensive than on-demand EC2, 5x more than Spot

  • No Spot support: Can't use Spot instances with Fargate (as of 2024)

  • Cold starts: Every job spins up a new microVM (30-60 seconds)

  • No node reuse: Can't cache Docker layers, npm packages, or Maven artifacts on disk

  • Resource limits: 4 vCPU / 30GB memory maximum per pod

  • No DaemonSets: Can't run node-level agents (though less relevant for CI/CD)

EKS Auto Mode with Spot: Managed Karpenter

Auto Mode gives you the operational simplicity of Fargate with the flexibility and cost of EC2.

How Auto Mode works for GitLab Runners:

Job triggered → GitLab Runner creates pod → Karpenter provisions Spot node (if needed) → Job runs on shared node → Pod terminates → Node consolidates when empty

Auto Mode Advantages:

  • Spot pricing: 60-70% cheaper than on-demand, 80%+ cheaper than Fargate

  • Intelligent provisioning: Karpenter selects optimal instance types in seconds

  • Node reuse: Multiple jobs share warm nodes—no cold starts

  • Diverse instance pools: Specify many instance types for Spot availability

  • Managed Karpenter: AWS handles installation, upgrades, and security patches

  • Consolidation: Automatically terminates unused nodes

Auto Mode Considerations:

  • Spot interruptions: 2-minute warning when instances are reclaimed (mitigated by diverse instance types)

  • Some node awareness needed: You configure NodePools, though it's simpler than managing node groups

  • Shared security model: Multiple pods share a node (use Pod Security Standards)

Cost Comparison: Real Numbers

For our workload (~100 CI jobs/day, average 15 minutes each):

Approach Monthly Compute Cost Explanation
Fargate ~$143 Premium per-second billing, every job is a new microVM
On-Demand EC2 ~$85 Better bin-packing, but still paying full price
Spot EC2 ~$45 60-70% discount, excellent bin-packing with Karpenter
Reserved ~$55 Requires commitment, doesn't scale to zero

Spot on Auto Mode delivered 68% compute savings compared to Fargate.

When to Choose Each

Choose Fargate when:

  • Security requires microVM isolation per job

  • Workloads are unpredictable or very low volume

  • Team lacks Kubernetes experience

  • Simplicity is worth the cost premium

Choose Auto Mode with Spot when:

  • Cost optimization is a priority

  • You run enough jobs to benefit from node reuse

  • You can tolerate occasional Spot interruptions (GitLab retries automatically)

  • Team is comfortable with basic Kubernetes concepts


The Solution: EKS Auto Mode with Spot Instances

In late 2024, AWS announced EKS Auto Mode—a fully managed experience that handles node provisioning, scaling, and lifecycle management automatically. Unlike traditional EKS where you manage node groups or install Karpenter yourself, Auto Mode includes:

  • Built-in Karpenter: AWS manages the Karpenter installation and upgrades

  • Managed node classes: Pre-configured, secure node templates

  • Automatic scaling: Nodes spin up in seconds when pods are pending

  • Spot instance support: Native integration with EC2 Spot for massive cost savings

This was exactly what we needed—the operational simplicity approaching Fargate with the cost efficiency of EC2 Spot.


Why CDK Instead of CLI Commands?

AWS published an excellent blog post titled "Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode" that walks through setting up GitLab Runners on Auto Mode using CLI commands. It's a great tutorial for getting started.

However, for production infrastructure, we chose AWS CDK (Infrastructure as Code) instead. Here's why:

1. Reproducibility

# CLI approach - hope you documented everything
eksctl create cluster --name gitlab-runners --version 1.34 ...
kubectl apply -f nodepool.yaml
helm install gitlab-runner ...

vs.

// CDK approach - the code IS the documentation
const cluster = new eks.Cluster(this, 'GitLabRunners', {
  version: eks.KubernetesVersion.V1_34,
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,
  // Every configuration decision is captured here
});

With CDK, our entire cluster configuration is version-controlled. Six months from now, when someone asks "why did we configure Karpenter consolidation this way?", the answer is in the Git history.

2. Multi-Environment Consistency

We run separate clusters for development, staging, and production CI/CD. CDK lets us define the infrastructure once and deploy it consistently:

// Same stack, different environments
new GitLabRunnersStack(app, 'Dev', { env: 'development' });
new GitLabRunnersStack(app, 'Prod', { env: 'production' });

With CLI commands, you're copying and pasting between environments, inevitably introducing drift.

3. Dependency Management

Our GitLab Runner stack depends on:

  • An existing VPC with specific subnets

  • IAM roles with IRSA (IAM Roles for Service Accounts)

  • S3 buckets for build cache

  • Secrets Manager for runner tokens

CDK handles these dependencies elegantly:

const cacheBucket = new s3.Bucket(this, 'RunnerCache', {
  lifecycleRules: [{ expiration: cdk.Duration.days(30) }],
});

const runnerRole = new iam.Role(this, 'RunnerRole', {
  assumedBy: new iam.FederatedPrincipal(
    cluster.openIdConnectProvider.openIdConnectProviderArn,
    // IRSA trust policy automatically configured
  ),
});

runnerRole.addToPolicy(new iam.PolicyStatement({
  actions: ['s3:GetObject', 's3:PutObject'],
  resources: [cacheBucket.arnForObjects('*')],
}));

4. Safer Updates

When we needed to change the Karpenter NodePool configuration (more on that later), CDK gave us:

  • cdk diff to preview changes before applying

  • CloudFormation rollback if something went wrong

  • A clear audit trail of what changed and when

5. Integration with Existing Infrastructure

Our CDK codebase already manages VPCs, databases, and other AWS resources. Adding the GitLab Runners stack meant it automatically inherited:

  • Consistent tagging policies

  • Security group rules

  • Monitoring and alerting configuration

  • Cost allocation tags


The Implementation

Cluster Configuration

const cluster = new eks.Cluster(this, 'GitLabRunnersCluster', {
  clusterName: `gitlab-runners-${environment}`,
  version: eks.KubernetesVersion.V1_34,
  kubectlLayer: new KubectlV34Layer(this, 'KubectlLayer'),
  
  // This is the magic - Auto Mode handles everything
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,
  
  // Use existing VPC
  vpc: ec2.Vpc.fromLookup(this, 'Vpc', { vpcId: config.vpcId }),
  vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }],
  
  // Enable cluster logging for debugging
  clusterLogging: [
    eks.ClusterLoggingTypes.API,
    eks.ClusterLoggingTypes.AUDIT,
    eks.ClusterLoggingTypes.SCHEDULER,
  ],
});

NodePool for Spot Instances

The key to cost savings is the Karpenter NodePool configuration:

const spotNodePool = new eks.KubernetesManifest(this, 'SpotNodePool', {
  cluster,
  manifest: [{
    apiVersion: 'karpenter.sh/v1',
    kind: 'NodePool',
    metadata: { name: 'gitlab-spot' },
    spec: {
      template: {
        metadata: {
          labels: { 'node-type': 'spot' }
        },
        spec: {
          nodeClassRef: {
            group: 'eks.amazonaws.com',
            kind: 'NodeClass',
            name: 'default'  // Use Auto Mode's managed node class
          },
          requirements: [
            {
              key: 'node.kubernetes.io/instance-type',
              operator: 'In',
              values: [
                // Diverse instance types for spot availability
                'm6a.large', 'm6a.xlarge',
                'm7a.large', 'm7a.xlarge',
                'c6a.large', 'c6a.xlarge',
                'c7a.large', 'c7a.xlarge',
              ]
            },
            {
              key: 'karpenter.sh/capacity-type',
              operator: 'In',
              values: ['spot']  // Spot instances only
            },
          ],
        }
      },
      disruption: {
        consolidationPolicy: 'WhenEmpty',
        consolidateAfter: '5m',
        budgets: [{ nodes: '100%' }]  // Allow consolidation of empty nodes
      }
    }
  }]
});

GitLab Runner Helm Chart

const gitlabRunner = cluster.addHelmChart('GitLabRunner', {
  chart: 'gitlab-runner',
  repository: 'https://charts.gitlab.io',
  namespace: 'gitlab',
  values: {
    gitlabUrl: 'https://gitlab.com',
    runnerToken: runnerToken.secretValue.unsafeUnwrap(),
    concurrent: 20,
    runners: {
      config: `
        [[runners]]
          executor = "kubernetes"
          [runners.kubernetes]
            namespace = "gitlab"
            cpu_request = "2"
            memory_request = "4Gi"
            [runners.kubernetes.node_selector]
              node-type = "spot"
            [runners.kubernetes.pod_annotations]
              karpenter.sh/do-not-disrupt = "true"
      `
    }
  }
});

The Results

After migrating and decommissioning the old Fargate cluster:

Metric Before (Fargate) After (Auto Mode) Change
Monthly compute $143 ~$45 -68%
Total monthly $385 ~$230 -40%
Annual savings - ~$1,860 -

The dramatic compute savings come from:

  1. Spot pricing: 60-70% cheaper than on-demand, and 80%+ cheaper than Fargate

  2. Efficient bin-packing: Multiple CI jobs share the same node

  3. Right-sized instances: Karpenter picks the optimal instance type for pending pods


Lessons Learned (The Hard Way)

Lesson 1: Understanding Karpenter Disruption Budgets

Our first deployment used settings designed to prevent job disruption:

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 5m
  budgets:
    - nodes: '0'  # WRONG - this blocks ALL consolidation!

We thought nodes: '0' meant "don't evict nodes with running pods." What it actually means is "you can disrupt ZERO nodes at any time"—which completely blocks consolidation, even for empty nodes.

The symptom: After CI jobs completed, nodes with only DaemonSet pods (CloudWatch agent, GuardDuty agent) would never terminate. We ended up with 14 orphaned nodes running indefinitely, burning money.

The fix: Use nodes: '100%' to allow Karpenter to consolidate empty nodes:

disruption:
  consolidationPolicy: WhenEmpty      # Only consolidate when NO workload pods
  consolidateAfter: 5m                # Wait 5 minutes after becoming empty
  budgets:
    - nodes: '100%'                   # Allow consolidation of all empty nodes

How job protection actually works:

  • WhenEmpty policy: Karpenter ignores DaemonSet pods when determining if a node is "empty"—a node with only CloudWatch/GuardDuty agents IS considered empty

  • do-not-disrupt annotation: Pods with this annotation prevent their node from being consolidated

  • Combined effect: Running CI jobs are protected by the annotation, but nodes scale down within 5 minutes of jobs completing

Lesson 2: Add the do-not-disrupt Annotation

For explicit job protection, annotate your job pods:

[runners.kubernetes.pod_annotations]
  karpenter.sh/do-not-disrupt = "true"

This tells Karpenter: "Never consolidate a node while this pod is running." When the CI job completes and the pod terminates, the annotation goes with it, and the node becomes eligible for consolidation.

Lesson 3: Avoid Burstable Instances for CI/CD Job Pods

We initially included t3/t3a instances in our NodePool for CI job pods. Bad idea.

CI/CD workloads (especially Maven/Gradle builds) are CPU-intensive and will exhaust the burst credits quickly. Once credits are gone, you're throttled to 20% baseline CPU, and your 10-minute build becomes a 50-minute build.

Use m-series or c-series instances that provide consistent CPU performance for job pods:

requirements:
  - key: 'node.kubernetes.io/instance-type'
    operator: In
    values:
      # Good: Fixed-performance instances for CI job pods
      - 'm6a.large', 'm6a.xlarge'
      - 'c6a.large', 'c6a.xlarge'
      # Bad: Burstable instances for CI jobs (avoid these)
      # - 't3.large', 't3a.xlarge'  # DON'T USE for job pods

Note: This advice applies to the Spot NodePool where CI jobs run. We do use a t3a.medium for the GitLab Runner manager itself—the lightweight, always-on process that coordinates jobs. The runner manager barely uses any CPU (it's just polling GitLab for jobs and creating pods), so it stays well within its burst credits. In fact, we downsized the runner manager from a c7a.medium to a t3a.medium during the migration, saving an additional ~$40/month on that single instance alone. The key distinction: burstable is fine for control plane workloads, not for compute-heavy CI jobs.

Lesson 4: Node Labels Must Match Job Selectors

The Auto Mode NodePool must include labels that match your job pod's nodeSelector:

# NodePool template
metadata:
  labels:
    node-type: spot  # Must match...

# GitLab Runner config
[runners.kubernetes.node_selector]
  node-type = "spot"  # ...this selector

If these don't match, Karpenter won't provision nodes for your jobs—pods will stay pending forever.

Lesson 5: Diverse Instance Types for Spot Availability

Don't just specify one or two instance types. Spot capacity varies by instance type and availability zone. More options = higher chance of getting capacity:

values:
  - 'm6a.large', 'm6a.xlarge', 'm6a.2xlarge'
  - 'm7a.large', 'm7a.xlarge', 'm7a.2xlarge'
  - 'c6a.large', 'c6a.xlarge', 'c6a.2xlarge'
  - 'c7a.large', 'c7a.xlarge', 'c7a.2xlarge'

Karpenter will automatically select from available capacity at the best price.


Drawbacks and Considerations

No solution is perfect. Here are the trade-offs:

1. Spot Interruptions

Spot instances can be reclaimed with 2 minutes notice. For CI/CD:

  • Mitigated by: Diverse instance types (Karpenter will find alternatives)

  • Mitigated by: Most CI jobs are under 30 minutes

  • Mitigated by: GitLab automatically retries failed jobs

  • Reality: We've seen <5% interruption rate in eu-west-1

2. Cold Start Latency

New nodes take 45-90 seconds to provision vs. 30-60 seconds for Fargate pods. In practice, this is negligible because:

  • Karpenter keeps nodes running for consolidateAfter duration

  • Subsequent jobs reuse warm nodes

  • Only the first job after idle periods sees the delay

3. Increased Complexity

Auto Mode is simpler than self-managed Karpenter, but still more complex than Fargate:

  • You need to understand NodePools and disruption settings

  • Debugging requires node-level visibility occasionally

  • More configuration knobs to tune

4. Cost Visibility

With Fargate, costs are per-pod and easy to attribute. With shared nodes, cost allocation becomes fuzzier. AWS Cost Explorer shows EC2 costs, but not which CI jobs caused them.

5. Shared Node Security

Multiple CI jobs share the same node. If you run untrusted code:

  • Use Pod Security Standards (restricted profile)

  • Consider separate NodePools for different trust levels

  • Or stick with Fargate for strict isolation


When to Stick with Fargate

Despite our migration, Fargate is still the right choice for:

  • Low-volume CI/CD: If you run <10 pipelines/day, Fargate's simplicity wins

  • Strict isolation requirements: Each Fargate pod is a separate microVM

  • Unpredictable workloads: Fargate scales to zero perfectly

  • Teams without Kubernetes expertise: Fargate abstracts more complexity

  • Running untrusted code: MicroVM isolation is stronger than pod isolation


Conclusion

Migrating from EKS Fargate to EKS Auto Mode with Spot instances reduced our CI/CD infrastructure costs by 40%. The key enablers were:

  1. EKS Auto Mode: Managed Karpenter without the operational burden

  2. Spot instances: 70% cheaper than on-demand, 80%+ cheaper than Fargate

  3. CDK Infrastructure as Code: Reproducible, version-controlled, and safe to update

The migration wasn't without challenges—understanding Karpenter's disruption budgets took some debugging. But with the right configuration, we now have a cost-effective, reliable CI/CD infrastructure that scales automatically.

Key takeaways:

  • Use WhenEmpty with budgets: [{ nodes: '100%' }] for proper consolidation

  • Add do-not-disrupt annotations to protect running jobs

  • Avoid burstable instances (t3/t3a) for CPU-intensive CI workloads

  • Specify diverse instance types for Spot availability

If your Fargate bill is growing and you're comfortable with Kubernetes, EKS Auto Mode is worth serious consideration. The 40% savings we achieved compound quickly, and the operational overhead is minimal thanks to AWS managing the hard parts.


Have questions about this migration? Found a better approach? I'd love to hear from you.


Resources