Migrating GitLab Runners from EKS Fargate to EKS Auto Mode: A 40% Cost Reduction Journey

The Problem: Fargate Costs Were Adding Up

For over a year, we ran our GitLab Runners on Amazon EKS with AWS Fargate. The serverless approach was appealing—no nodes to manage, automatic scaling, and a simple mental model. Each CI/CD job spun up as a Fargate pod, ran its tests, and disappeared.

But as our engineering team grew and pipeline frequency increased, the monthly bill told a different story:

Component	Monthly Cost
Fargate vCPU hours	~$115
Fargate memory hours	~$28
EKS control plane	~$74
NAT Gateway	~$77
Supporting infrastructure	~$90
Total	~$385/month

The Fargate compute alone was costing us $143/month for what amounted to intermittent CI/CD workloads. Our pipelines ran maybe 4-6 hours of actual compute per day, yet we were paying premium serverless pricing for every second.

Why Fargate Becomes Expensive for CI/CD

Fargate pricing in eu-west-1:

$0.04048 per vCPU per hour
$0.004445 per GB memory per hour

For a typical CI job requesting 2 vCPU and 4GB memory running for 15 minutes:

Fargate cost: ~$0.034 per job
Equivalent spot instance (m6a.large): ~$0.007 per job

That's nearly 5x more expensive than spot instances for the same workload.

Why Run GitLab Runners on Kubernetes (EKS)?

Before diving into our solution, it's worth understanding why we chose Kubernetes in the first place. There are several ways to run GitLab Runners on AWS:

Option 1: EC2 Instances (Docker/Shell Executor)

The traditional approach—run GitLab Runner directly on EC2 instances using the Docker or Shell executor.

Pros:

Simple to set up and understand
Full control over the environment
Works with existing EC2 knowledge

Cons:

Manual scaling: You manage autoscaling groups, launch templates, lifecycle hooks
Resource waste: Instances run 24/7 or you build complex scaling logic
Docker-in-Docker issues: Security concerns with privileged containers
Maintenance burden: OS patching, Docker updates, runner upgrades are your responsibility
No bin-packing: Each instance typically runs one job at a time (or complex queue management)

Option 2: ECS with Fargate or EC2

Amazon ECS offers container orchestration without Kubernetes complexity.

Pros:

Native AWS integration
Fargate provides serverless containers
Simpler than Kubernetes for basic use cases

Cons:

No native GitLab integration: GitLab Runner doesn't have an ECS executor—you'd need custom solutions
Task definition management: More complex than Kubernetes pods for dynamic workloads
Limited ecosystem: Fewer community tools and patterns compared to Kubernetes
Vendor lock-in: ECS is AWS-specific; Kubernetes skills transfer across clouds

Option 3: EKS with Kubernetes Executor (Our Choice)

GitLab Runner's Kubernetes executor is purpose-built for CI/CD on Kubernetes.

Pros:

Native GitLab integration: First-class support in GitLab Runner
Automatic pod lifecycle: Each job gets a fresh pod, automatic cleanup
Bin-packing: Multiple jobs share nodes efficiently
Karpenter/Auto Mode: Intelligent, automatic node provisioning in seconds
Ecosystem benefits: Helm charts, operators, monitoring tools
Portability: Same configuration works on any Kubernetes cluster
Security: Pod security standards, network policies, IRSA for AWS access

Cons:

Kubernetes complexity: Learning curve if you're new to K8s
More moving parts: Nodes, pods, services vs. just EC2 instances

Why EKS Won for Us

The deciding factors were:

GitLab's Kubernetes executor is mature: Handles job isolation, artifact management, and service containers natively
Karpenter changes everything: No more managing autoscaling groups—Karpenter provisions exactly what you need in seconds
Cost efficiency through bin-packing: Multiple CI jobs share a single node, maximizing utilization
Team skills: We already run production workloads on Kubernetes
Future flexibility: Easy to migrate to another cloud or on-premises if needed

Why EKS Auto Mode with Spot Over Standard EKS with Fargate?

This is the key architectural decision. Both approaches run on EKS, but they have fundamentally different characteristics.

EKS with Fargate: The Serverless Promise

Fargate abstracts away nodes entirely. You define pod specs, and AWS handles the compute.

How Fargate works for GitLab Runners:

Job triggered → GitLab Runner creates pod → Fargate provisions microVM → Job runs → Pod terminates

Fargate Advantages:

Zero node management
Per-second billing
Strong isolation (each pod is a separate microVM)
No capacity planning
Automatic security patching

Fargate Disadvantages for CI/CD:

Premium pricing: 20-40% more expensive than on-demand EC2, 5x more than Spot
No Spot support: Can't use Spot instances with Fargate (as of 2024)
Cold starts: Every job spins up a new microVM (30-60 seconds)
No node reuse: Can't cache Docker layers, npm packages, or Maven artifacts on disk
Resource limits: 4 vCPU / 30GB memory maximum per pod
No DaemonSets: Can't run node-level agents (though less relevant for CI/CD)

EKS Auto Mode with Spot: Managed Karpenter

Auto Mode gives you the operational simplicity of Fargate with the flexibility and cost of EC2.

How Auto Mode works for GitLab Runners:

Job triggered → GitLab Runner creates pod → Karpenter provisions Spot node (if needed) → Job runs on shared node → Pod terminates → Node consolidates when empty

Auto Mode Advantages:

Spot pricing: 60-70% cheaper than on-demand, 80%+ cheaper than Fargate
Intelligent provisioning: Karpenter selects optimal instance types in seconds
Node reuse: Multiple jobs share warm nodes—no cold starts
Diverse instance pools: Specify many instance types for Spot availability
Managed Karpenter: AWS handles installation, upgrades, and security patches
Consolidation: Automatically terminates unused nodes

Auto Mode Considerations:

Spot interruptions: 2-minute warning when instances are reclaimed (mitigated by diverse instance types)
Some node awareness needed: You configure NodePools, though it's simpler than managing node groups
Shared security model: Multiple pods share a node (use Pod Security Standards)

Cost Comparison: Real Numbers

For our workload (~100 CI jobs/day, average 15 minutes each):

Approach	Monthly Compute Cost	Explanation
Fargate	~$143	Premium per-second billing, every job is a new microVM
On-Demand EC2	~$85	Better bin-packing, but still paying full price
Spot EC2	~$45	60-70% discount, excellent bin-packing with Karpenter
Reserved	~$55	Requires commitment, doesn't scale to zero

Spot on Auto Mode delivered 68% compute savings compared to Fargate.

When to Choose Each

Choose Fargate when:

Security requires microVM isolation per job
Workloads are unpredictable or very low volume
Team lacks Kubernetes experience
Simplicity is worth the cost premium

Choose Auto Mode with Spot when:

Cost optimization is a priority
You run enough jobs to benefit from node reuse
You can tolerate occasional Spot interruptions (GitLab retries automatically)
Team is comfortable with basic Kubernetes concepts

The Solution: EKS Auto Mode with Spot Instances

In late 2024, AWS announced EKS Auto Mode—a fully managed experience that handles node provisioning, scaling, and lifecycle management automatically. Unlike traditional EKS where you manage node groups or install Karpenter yourself, Auto Mode includes:

Built-in Karpenter: AWS manages the Karpenter installation and upgrades
Managed node classes: Pre-configured, secure node templates
Automatic scaling: Nodes spin up in seconds when pods are pending
Spot instance support: Native integration with EC2 Spot for massive cost savings

This was exactly what we needed—the operational simplicity approaching Fargate with the cost efficiency of EC2 Spot.

Why CDK Instead of CLI Commands?

AWS published an excellent blog post titled "Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode" that walks through setting up GitLab Runners on Auto Mode using CLI commands. It's a great tutorial for getting started.

However, for production infrastructure, we chose AWS CDK (Infrastructure as Code) instead. Here's why:

1. Reproducibility

# CLI approach - hope you documented everything
eksctl create cluster --name gitlab-runners --version 1.34 ...
kubectl apply -f nodepool.yaml
helm install gitlab-runner ...

vs.

// CDK approach - the code IS the documentation
const cluster = new eks.Cluster(this, 'GitLabRunners', {
  version: eks.KubernetesVersion.V1_34,
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,
  // Every configuration decision is captured here
});

With CDK, our entire cluster configuration is version-controlled. Six months from now, when someone asks "why did we configure Karpenter consolidation this way?", the answer is in the Git history.

2. Multi-Environment Consistency

We run separate clusters for development, staging, and production CI/CD. CDK lets us define the infrastructure once and deploy it consistently:

// Same stack, different environments
new GitLabRunnersStack(app, 'Dev', { env: 'development' });
new GitLabRunnersStack(app, 'Prod', { env: 'production' });

With CLI commands, you're copying and pasting between environments, inevitably introducing drift.

3. Dependency Management

Our GitLab Runner stack depends on:

An existing VPC with specific subnets
IAM roles with IRSA (IAM Roles for Service Accounts)
S3 buckets for build cache
Secrets Manager for runner tokens

CDK handles these dependencies elegantly:

const cacheBucket = new s3.Bucket(this, 'RunnerCache', {
  lifecycleRules: [{ expiration: cdk.Duration.days(30) }],
});

const runnerRole = new iam.Role(this, 'RunnerRole', {
  assumedBy: new iam.FederatedPrincipal(
    cluster.openIdConnectProvider.openIdConnectProviderArn,
    // IRSA trust policy automatically configured
  ),
});

runnerRole.addToPolicy(new iam.PolicyStatement({
  actions: ['s3:GetObject', 's3:PutObject'],
  resources: [cacheBucket.arnForObjects('*')],
}));

4. Safer Updates

When we needed to change the Karpenter NodePool configuration (more on that later), CDK gave us:

cdk diff to preview changes before applying
CloudFormation rollback if something went wrong
A clear audit trail of what changed and when

5. Integration with Existing Infrastructure

Our CDK codebase already manages VPCs, databases, and other AWS resources. Adding the GitLab Runners stack meant it automatically inherited:

Consistent tagging policies
Security group rules
Monitoring and alerting configuration
Cost allocation tags

The Implementation

Cluster Configuration

const cluster = new eks.Cluster(this, 'GitLabRunnersCluster', {
  clusterName: `gitlab-runners-${environment}`,
  version: eks.KubernetesVersion.V1_34,
  kubectlLayer: new KubectlV34Layer(this, 'KubectlLayer'),
  
  // This is the magic - Auto Mode handles everything
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,
  
  // Use existing VPC
  vpc: ec2.Vpc.fromLookup(this, 'Vpc', { vpcId: config.vpcId }),
  vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }],
  
  // Enable cluster logging for debugging
  clusterLogging: [
    eks.ClusterLoggingTypes.API,
    eks.ClusterLoggingTypes.AUDIT,
    eks.ClusterLoggingTypes.SCHEDULER,
  ],
});

NodePool for Spot Instances

The key to cost savings is the Karpenter NodePool configuration:

const spotNodePool = new eks.KubernetesManifest(this, 'SpotNodePool', {
  cluster,
  manifest: [{
    apiVersion: 'karpenter.sh/v1',
    kind: 'NodePool',
    metadata: { name: 'gitlab-spot' },
    spec: {
      template: {
        metadata: {
          labels: { 'node-type': 'spot' }
        },
        spec: {
          nodeClassRef: {
            group: 'eks.amazonaws.com',
            kind: 'NodeClass',
            name: 'default'  // Use Auto Mode's managed node class
          },
          requirements: [
            {
              key: 'node.kubernetes.io/instance-type',
              operator: 'In',
              values: [
                // Diverse instance types for spot availability
                'm6a.large', 'm6a.xlarge',
                'm7a.large', 'm7a.xlarge',
                'c6a.large', 'c6a.xlarge',
                'c7a.large', 'c7a.xlarge',
              ]
            },
            {
              key: 'karpenter.sh/capacity-type',
              operator: 'In',
              values: ['spot']  // Spot instances only
            },
          ],
        }
      },
      disruption: {
        consolidationPolicy: 'WhenEmpty',
        consolidateAfter: '5m',
        budgets: [{ nodes: '100%' }]  // Allow consolidation of empty nodes
      }
    }
  }]
});

GitLab Runner Helm Chart

const gitlabRunner = cluster.addHelmChart('GitLabRunner', {
  chart: 'gitlab-runner',
  repository: 'https://charts.gitlab.io',
  namespace: 'gitlab',
  values: {
    gitlabUrl: 'https://gitlab.com',
    runnerToken: runnerToken.secretValue.unsafeUnwrap(),
    concurrent: 20,
    runners: {
      config: `
        [[runners]]
          executor = "kubernetes"
          [runners.kubernetes]
            namespace = "gitlab"
            cpu_request = "2"
            memory_request = "4Gi"
            [runners.kubernetes.node_selector]
              node-type = "spot"
            [runners.kubernetes.pod_annotations]
              karpenter.sh/do-not-disrupt = "true"
      `
    }
  }
});

The Results

After migrating and decommissioning the old Fargate cluster:

Metric	Before (Fargate)	After (Auto Mode)	Change
Monthly compute	$143	~$45	-68%
Total monthly	$385	~$230	-40%
Annual savings	-	~$1,860	-

The dramatic compute savings come from:

Spot pricing: 60-70% cheaper than on-demand, and 80%+ cheaper than Fargate
Efficient bin-packing: Multiple CI jobs share the same node
Right-sized instances: Karpenter picks the optimal instance type for pending pods

Lessons Learned (The Hard Way)

Lesson 1: Understanding Karpenter Disruption Budgets

Our first deployment used settings designed to prevent job disruption:

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 5m
  budgets:
    - nodes: '0'  # WRONG - this blocks ALL consolidation!

We thought nodes: '0' meant "don't evict nodes with running pods." What it actually means is "you can disrupt ZERO nodes at any time"—which completely blocks consolidation, even for empty nodes.

The symptom: After CI jobs completed, nodes with only DaemonSet pods (CloudWatch agent, GuardDuty agent) would never terminate. We ended up with 14 orphaned nodes running indefinitely, burning money.

The fix: Use nodes: '100%' to allow Karpenter to consolidate empty nodes:

disruption:
  consolidationPolicy: WhenEmpty      # Only consolidate when NO workload pods
  consolidateAfter: 5m                # Wait 5 minutes after becoming empty
  budgets:
    - nodes: '100%'                   # Allow consolidation of all empty nodes

How job protection actually works:

WhenEmpty policy: Karpenter ignores DaemonSet pods when determining if a node is "empty"—a node with only CloudWatch/GuardDuty agents IS considered empty
do-not-disrupt annotation: Pods with this annotation prevent their node from being consolidated
Combined effect: Running CI jobs are protected by the annotation, but nodes scale down within 5 minutes of jobs completing

Lesson 2: Add the `do-not-disrupt` Annotation

For explicit job protection, annotate your job pods:

[runners.kubernetes.pod_annotations]
  karpenter.sh/do-not-disrupt = "true"

This tells Karpenter: "Never consolidate a node while this pod is running." When the CI job completes and the pod terminates, the annotation goes with it, and the node becomes eligible for consolidation.

Lesson 3: Avoid Burstable Instances for CI/CD Job Pods

We initially included t3/t3a instances in our NodePool for CI job pods. Bad idea.

CI/CD workloads (especially Maven/Gradle builds) are CPU-intensive and will exhaust the burst credits quickly. Once credits are gone, you're throttled to 20% baseline CPU, and your 10-minute build becomes a 50-minute build.

Use m-series or c-series instances that provide consistent CPU performance for job pods:

requirements:
  - key: 'node.kubernetes.io/instance-type'
    operator: In
    values:
      # Good: Fixed-performance instances for CI job pods
      - 'm6a.large', 'm6a.xlarge'
      - 'c6a.large', 'c6a.xlarge'
      # Bad: Burstable instances for CI jobs (avoid these)
      # - 't3.large', 't3a.xlarge'  # DON'T USE for job pods

Note: This advice applies to the Spot NodePool where CI jobs run. We do use a t3a.medium for the GitLab Runner manager itself—the lightweight, always-on process that coordinates jobs. The runner manager barely uses any CPU (it's just polling GitLab for jobs and creating pods), so it stays well within its burst credits. In fact, we downsized the runner manager from a c7a.medium to a t3a.medium during the migration, saving an additional ~$40/month on that single instance alone. The key distinction: burstable is fine for control plane workloads, not for compute-heavy CI jobs.

Lesson 4: Node Labels Must Match Job Selectors

The Auto Mode NodePool must include labels that match your job pod's nodeSelector:

# NodePool template
metadata:
  labels:
    node-type: spot  # Must match...

# GitLab Runner config
[runners.kubernetes.node_selector]
  node-type = "spot"  # ...this selector

If these don't match, Karpenter won't provision nodes for your jobs—pods will stay pending forever.

Lesson 5: Diverse Instance Types for Spot Availability

Don't just specify one or two instance types. Spot capacity varies by instance type and availability zone. More options = higher chance of getting capacity:

values:
  - 'm6a.large', 'm6a.xlarge', 'm6a.2xlarge'
  - 'm7a.large', 'm7a.xlarge', 'm7a.2xlarge'
  - 'c6a.large', 'c6a.xlarge', 'c6a.2xlarge'
  - 'c7a.large', 'c7a.xlarge', 'c7a.2xlarge'

Karpenter will automatically select from available capacity at the best price.

Drawbacks and Considerations

No solution is perfect. Here are the trade-offs:

1. Spot Interruptions

Spot instances can be reclaimed with 2 minutes notice. For CI/CD:

Mitigated by: Diverse instance types (Karpenter will find alternatives)
Mitigated by: Most CI jobs are under 30 minutes
Mitigated by: GitLab automatically retries failed jobs
Reality: We've seen <5% interruption rate in eu-west-1

2. Cold Start Latency

New nodes take 45-90 seconds to provision vs. 30-60 seconds for Fargate pods. In practice, this is negligible because:

Karpenter keeps nodes running for consolidateAfter duration
Subsequent jobs reuse warm nodes
Only the first job after idle periods sees the delay

3. Increased Complexity

Auto Mode is simpler than self-managed Karpenter, but still more complex than Fargate:

You need to understand NodePools and disruption settings
Debugging requires node-level visibility occasionally
More configuration knobs to tune

4. Cost Visibility

With Fargate, costs are per-pod and easy to attribute. With shared nodes, cost allocation becomes fuzzier. AWS Cost Explorer shows EC2 costs, but not which CI jobs caused them.

5. Shared Node Security

Multiple CI jobs share the same node. If you run untrusted code:

Use Pod Security Standards (restricted profile)
Consider separate NodePools for different trust levels
Or stick with Fargate for strict isolation

When to Stick with Fargate

Despite our migration, Fargate is still the right choice for:

Low-volume CI/CD: If you run <10 pipelines/day, Fargate's simplicity wins
Strict isolation requirements: Each Fargate pod is a separate microVM
Unpredictable workloads: Fargate scales to zero perfectly
Teams without Kubernetes expertise: Fargate abstracts more complexity
Running untrusted code: MicroVM isolation is stronger than pod isolation

Conclusion

Migrating from EKS Fargate to EKS Auto Mode with Spot instances reduced our CI/CD infrastructure costs by 40%. The key enablers were:

EKS Auto Mode: Managed Karpenter without the operational burden
Spot instances: 70% cheaper than on-demand, 80%+ cheaper than Fargate
CDK Infrastructure as Code: Reproducible, version-controlled, and safe to update

The migration wasn't without challenges—understanding Karpenter's disruption budgets took some debugging. But with the right configuration, we now have a cost-effective, reliable CI/CD infrastructure that scales automatically.

Key takeaways:

Use WhenEmpty with budgets: [{ nodes: '100%' }] for proper consolidation
Add do-not-disrupt annotations to protect running jobs
Avoid burstable instances (t3/t3a) for CPU-intensive CI workloads
Specify diverse instance types for Spot availability

If your Fargate bill is growing and you're comfortable with Kubernetes, EKS Auto Mode is worth serious consideration. The 40% savings we achieved compound quickly, and the operational overhead is minimal thanks to AWS managing the hard parts.

Have questions about this migration? Found a better approach? I'd love to hear from you.

Command Palette

The Problem: Fargate Costs Were Adding Up

Why Fargate Becomes Expensive for CI/CD

Why Run GitLab Runners on Kubernetes (EKS)?

Option 1: EC2 Instances (Docker/Shell Executor)

Option 2: ECS with Fargate or EC2

Option 3: EKS with Kubernetes Executor (Our Choice)

Why EKS Won for Us

Why EKS Auto Mode with Spot Over Standard EKS with Fargate?

EKS with Fargate: The Serverless Promise

EKS Auto Mode with Spot: Managed Karpenter

Cost Comparison: Real Numbers

When to Choose Each

The Solution: EKS Auto Mode with Spot Instances

Why CDK Instead of CLI Commands?

1. Reproducibility

2. Multi-Environment Consistency

3. Dependency Management

4. Safer Updates

5. Integration with Existing Infrastructure

The Implementation

Cluster Configuration

NodePool for Spot Instances

GitLab Runner Helm Chart

The Results

Lessons Learned (The Hard Way)

Lesson 1: Understanding Karpenter Disruption Budgets

Lesson 2: Add the do-not-disrupt Annotation

Lesson 3: Avoid Burstable Instances for CI/CD Job Pods

Lesson 4: Node Labels Must Match Job Selectors

Lesson 5: Diverse Instance Types for Spot Availability

Drawbacks and Considerations

1. Spot Interruptions

2. Cold Start Latency

3. Increased Complexity

4. Cost Visibility

5. Shared Node Security

When to Stick with Fargate

Conclusion

Resources

Comments

More from this blog

Lesson 2: Add the `do-not-disrupt` Annotation