Pawan's Tech Blog

Migrating GitLab Runners from EKS Fargate to EKS Auto Mode: A 40% Cost Reduction Journey

Pawan Sawalani — Tue, 24 Mar 2026 11:50:54 GMT

The Problem: Fargate Costs Were Adding Up

For over a year, we ran our GitLab Runners on Amazon EKS with AWS Fargate. The serverless approach was appealing—no nodes to manage, automatic scaling, and a simple mental model. Each CI/CD job spun up as a Fargate pod, ran its tests, and disappeared.

But as our engineering team grew and pipeline frequency increased, the monthly bill told a different story:

Component	Monthly Cost
Fargate vCPU hours	~$115
Fargate memory hours	~$28
EKS control plane	~$74
NAT Gateway	~$77
Supporting infrastructure	~$90
Total	~$385/month

The Fargate compute alone was costing us $143/month for what amounted to intermittent CI/CD workloads. Our pipelines ran maybe 4-6 hours of actual compute per day, yet we were paying premium serverless pricing for every second.

Why Fargate Becomes Expensive for CI/CD

Fargate pricing in eu-west-1:

$0.04048 per vCPU per hour
$0.004445 per GB memory per hour

For a typical CI job requesting 2 vCPU and 4GB memory running for 15 minutes:

Fargate cost: ~$0.034 per job
Equivalent spot instance (m6a.large): ~$0.007 per job

That's nearly 5x more expensive than spot instances for the same workload.

Why Run GitLab Runners on Kubernetes (EKS)?

Before diving into our solution, it's worth understanding why we chose Kubernetes in the first place. There are several ways to run GitLab Runners on AWS:

Option 1: EC2 Instances (Docker/Shell Executor)

The traditional approach—run GitLab Runner directly on EC2 instances using the Docker or Shell executor.

Pros:

Simple to set up and understand
Full control over the environment
Works with existing EC2 knowledge

Cons:

Manual scaling: You manage autoscaling groups, launch templates, lifecycle hooks
Resource waste: Instances run 24/7 or you build complex scaling logic
Docker-in-Docker issues: Security concerns with privileged containers
Maintenance burden: OS patching, Docker updates, runner upgrades are your responsibility
No bin-packing: Each instance typically runs one job at a time (or complex queue management)

Option 2: ECS with Fargate or EC2

Amazon ECS offers container orchestration without Kubernetes complexity.

Pros:

Native AWS integration
Fargate provides serverless containers
Simpler than Kubernetes for basic use cases

Cons:

No native GitLab integration: GitLab Runner doesn't have an ECS executor—you'd need custom solutions
Task definition management: More complex than Kubernetes pods for dynamic workloads
Limited ecosystem: Fewer community tools and patterns compared to Kubernetes
Vendor lock-in: ECS is AWS-specific; Kubernetes skills transfer across clouds

Option 3: EKS with Kubernetes Executor (Our Choice)

GitLab Runner's Kubernetes executor is purpose-built for CI/CD on Kubernetes.

Pros:

Native GitLab integration: First-class support in GitLab Runner
Automatic pod lifecycle: Each job gets a fresh pod, automatic cleanup
Bin-packing: Multiple jobs share nodes efficiently
Karpenter/Auto Mode: Intelligent, automatic node provisioning in seconds
Ecosystem benefits: Helm charts, operators, monitoring tools
Portability: Same configuration works on any Kubernetes cluster
Security: Pod security standards, network policies, IRSA for AWS access

Cons:

Kubernetes complexity: Learning curve if you're new to K8s
More moving parts: Nodes, pods, services vs. just EC2 instances

Why EKS Won for Us

The deciding factors were:

GitLab's Kubernetes executor is mature: Handles job isolation, artifact management, and service containers natively
Karpenter changes everything: No more managing autoscaling groups—Karpenter provisions exactly what you need in seconds
Cost efficiency through bin-packing: Multiple CI jobs share a single node, maximizing utilization
Team skills: We already run production workloads on Kubernetes
Future flexibility: Easy to migrate to another cloud or on-premises if needed

Why EKS Auto Mode with Spot Over Standard EKS with Fargate?

This is the key architectural decision. Both approaches run on EKS, but they have fundamentally different characteristics.

EKS with Fargate: The Serverless Promise

Fargate abstracts away nodes entirely. You define pod specs, and AWS handles the compute.

How Fargate works for GitLab Runners:

Job triggered → GitLab Runner creates pod → Fargate provisions microVM → Job runs → Pod terminates

Fargate Advantages:

Zero node management
Per-second billing
Strong isolation (each pod is a separate microVM)
No capacity planning
Automatic security patching

Fargate Disadvantages for CI/CD:

Premium pricing: 20-40% more expensive than on-demand EC2, 5x more than Spot
No Spot support: Can't use Spot instances with Fargate (as of 2024)
Cold starts: Every job spins up a new microVM (30-60 seconds)
No node reuse: Can't cache Docker layers, npm packages, or Maven artifacts on disk
Resource limits: 4 vCPU / 30GB memory maximum per pod
No DaemonSets: Can't run node-level agents (though less relevant for CI/CD)

EKS Auto Mode with Spot: Managed Karpenter

Auto Mode gives you the operational simplicity of Fargate with the flexibility and cost of EC2.

How Auto Mode works for GitLab Runners:

Job triggered → GitLab Runner creates pod → Karpenter provisions Spot node (if needed) → Job runs on shared node → Pod terminates → Node consolidates when empty

Auto Mode Advantages:

Spot pricing: 60-70% cheaper than on-demand, 80%+ cheaper than Fargate
Intelligent provisioning: Karpenter selects optimal instance types in seconds
Node reuse: Multiple jobs share warm nodes—no cold starts
Diverse instance pools: Specify many instance types for Spot availability
Managed Karpenter: AWS handles installation, upgrades, and security patches
Consolidation: Automatically terminates unused nodes

Auto Mode Considerations:

Spot interruptions: 2-minute warning when instances are reclaimed (mitigated by diverse instance types)
Some node awareness needed: You configure NodePools, though it's simpler than managing node groups
Shared security model: Multiple pods share a node (use Pod Security Standards)

Cost Comparison: Real Numbers

For our workload (~100 CI jobs/day, average 15 minutes each):

Approach	Monthly Compute Cost	Explanation
Fargate	~$143	Premium per-second billing, every job is a new microVM
On-Demand EC2	~$85	Better bin-packing, but still paying full price
Spot EC2	~$45	60-70% discount, excellent bin-packing with Karpenter
Reserved	~$55	Requires commitment, doesn't scale to zero

Spot on Auto Mode delivered 68% compute savings compared to Fargate.

When to Choose Each

Choose Fargate when:

Security requires microVM isolation per job
Workloads are unpredictable or very low volume
Team lacks Kubernetes experience
Simplicity is worth the cost premium

Choose Auto Mode with Spot when:

Cost optimization is a priority
You run enough jobs to benefit from node reuse
You can tolerate occasional Spot interruptions (GitLab retries automatically)
Team is comfortable with basic Kubernetes concepts

The Solution: EKS Auto Mode with Spot Instances

In late 2024, AWS announced EKS Auto Mode—a fully managed experience that handles node provisioning, scaling, and lifecycle management automatically. Unlike traditional EKS where you manage node groups or install Karpenter yourself, Auto Mode includes:

Built-in Karpenter: AWS manages the Karpenter installation and upgrades
Managed node classes: Pre-configured, secure node templates
Automatic scaling: Nodes spin up in seconds when pods are pending
Spot instance support: Native integration with EC2 Spot for massive cost savings

This was exactly what we needed—the operational simplicity approaching Fargate with the cost efficiency of EC2 Spot.

Why CDK Instead of CLI Commands?

AWS published an excellent blog post titled "Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode" that walks through setting up GitLab Runners on Auto Mode using CLI commands. It's a great tutorial for getting started.

However, for production infrastructure, we chose AWS CDK (Infrastructure as Code) instead. Here's why:

1. Reproducibility

# CLI approach - hope you documented everything
eksctl create cluster --name gitlab-runners --version 1.34 ...
kubectl apply -f nodepool.yaml
helm install gitlab-runner ...

vs.

// CDK approach - the code IS the documentation
const cluster = new eks.Cluster(this, 'GitLabRunners', {
  version: eks.KubernetesVersion.V1_34,
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,
  // Every configuration decision is captured here
});

With CDK, our entire cluster configuration is version-controlled. Six months from now, when someone asks "why did we configure Karpenter consolidation this way?", the answer is in the Git history.

2. Multi-Environment Consistency

We run separate clusters for development, staging, and production CI/CD. CDK lets us define the infrastructure once and deploy it consistently:

// Same stack, different environments
new GitLabRunnersStack(app, 'Dev', { env: 'development' });
new GitLabRunnersStack(app, 'Prod', { env: 'production' });

With CLI commands, you're copying and pasting between environments, inevitably introducing drift.

3. Dependency Management

Our GitLab Runner stack depends on:

An existing VPC with specific subnets
IAM roles with IRSA (IAM Roles for Service Accounts)
S3 buckets for build cache
Secrets Manager for runner tokens

CDK handles these dependencies elegantly:

const cacheBucket = new s3.Bucket(this, 'RunnerCache', {
  lifecycleRules: [{ expiration: cdk.Duration.days(30) }],
});

const runnerRole = new iam.Role(this, 'RunnerRole', {
  assumedBy: new iam.FederatedPrincipal(
    cluster.openIdConnectProvider.openIdConnectProviderArn,
    // IRSA trust policy automatically configured
  ),
});

runnerRole.addToPolicy(new iam.PolicyStatement({
  actions: ['s3:GetObject', 's3:PutObject'],
  resources: [cacheBucket.arnForObjects('*')],
}));

4. Safer Updates

When we needed to change the Karpenter NodePool configuration (more on that later), CDK gave us:

cdk diff to preview changes before applying
CloudFormation rollback if something went wrong
A clear audit trail of what changed and when

5. Integration with Existing Infrastructure

Our CDK codebase already manages VPCs, databases, and other AWS resources. Adding the GitLab Runners stack meant it automatically inherited:

Consistent tagging policies
Security group rules
Monitoring and alerting configuration
Cost allocation tags

The Implementation

Cluster Configuration

const cluster = new eks.Cluster(this, 'GitLabRunnersCluster', {
  clusterName: `gitlab-runners-${environment}`,
  version: eks.KubernetesVersion.V1_34,
  kubectlLayer: new KubectlV34Layer(this, 'KubectlLayer'),
  
  // This is the magic - Auto Mode handles everything
  defaultCapacityType: eks.DefaultCapacityType.AUTOMODE,
  
  // Use existing VPC
  vpc: ec2.Vpc.fromLookup(this, 'Vpc', { vpcId: config.vpcId }),
  vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }],
  
  // Enable cluster logging for debugging
  clusterLogging: [
    eks.ClusterLoggingTypes.API,
    eks.ClusterLoggingTypes.AUDIT,
    eks.ClusterLoggingTypes.SCHEDULER,
  ],
});

NodePool for Spot Instances

The key to cost savings is the Karpenter NodePool configuration:

const spotNodePool = new eks.KubernetesManifest(this, 'SpotNodePool', {
  cluster,
  manifest: [{
    apiVersion: 'karpenter.sh/v1',
    kind: 'NodePool',
    metadata: { name: 'gitlab-spot' },
    spec: {
      template: {
        metadata: {
          labels: { 'node-type': 'spot' }
        },
        spec: {
          nodeClassRef: {
            group: 'eks.amazonaws.com',
            kind: 'NodeClass',
            name: 'default'  // Use Auto Mode's managed node class
          },
          requirements: [
            {
              key: 'node.kubernetes.io/instance-type',
              operator: 'In',
              values: [
                // Diverse instance types for spot availability
                'm6a.large', 'm6a.xlarge',
                'm7a.large', 'm7a.xlarge',
                'c6a.large', 'c6a.xlarge',
                'c7a.large', 'c7a.xlarge',
              ]
            },
            {
              key: 'karpenter.sh/capacity-type',
              operator: 'In',
              values: ['spot']  // Spot instances only
            },
          ],
        }
      },
      disruption: {
        consolidationPolicy: 'WhenEmpty',
        consolidateAfter: '5m',
        budgets: [{ nodes: '100%' }]  // Allow consolidation of empty nodes
      }
    }
  }]
});

GitLab Runner Helm Chart

const gitlabRunner = cluster.addHelmChart('GitLabRunner', {
  chart: 'gitlab-runner',
  repository: 'https://charts.gitlab.io',
  namespace: 'gitlab',
  values: {
    gitlabUrl: 'https://gitlab.com',
    runnerToken: runnerToken.secretValue.unsafeUnwrap(),
    concurrent: 20,
    runners: {
      config: `
        [[runners]]
          executor = "kubernetes"
          [runners.kubernetes]
            namespace = "gitlab"
            cpu_request = "2"
            memory_request = "4Gi"
            [runners.kubernetes.node_selector]
              node-type = "spot"
            [runners.kubernetes.pod_annotations]
              karpenter.sh/do-not-disrupt = "true"
      `
    }
  }
});

The Results

After migrating and decommissioning the old Fargate cluster:

Metric	Before (Fargate)	After (Auto Mode)	Change
Monthly compute	$143	~$45	-68%
Total monthly	$385	~$230	-40%
Annual savings	-	~$1,860	-

The dramatic compute savings come from:

Spot pricing: 60-70% cheaper than on-demand, and 80%+ cheaper than Fargate
Efficient bin-packing: Multiple CI jobs share the same node
Right-sized instances: Karpenter picks the optimal instance type for pending pods

Lessons Learned (The Hard Way)

Lesson 1: Understanding Karpenter Disruption Budgets

Our first deployment used settings designed to prevent job disruption:

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 5m
  budgets:
    - nodes: '0'  # WRONG - this blocks ALL consolidation!

We thought nodes: '0' meant "don't evict nodes with running pods." What it actually means is "you can disrupt ZERO nodes at any time"—which completely blocks consolidation, even for empty nodes.

The symptom: After CI jobs completed, nodes with only DaemonSet pods (CloudWatch agent, GuardDuty agent) would never terminate. We ended up with 14 orphaned nodes running indefinitely, burning money.

The fix: Use nodes: '100%' to allow Karpenter to consolidate empty nodes:

disruption:
  consolidationPolicy: WhenEmpty      # Only consolidate when NO workload pods
  consolidateAfter: 5m                # Wait 5 minutes after becoming empty
  budgets:
    - nodes: '100%'                   # Allow consolidation of all empty nodes

How job protection actually works:

WhenEmpty policy: Karpenter ignores DaemonSet pods when determining if a node is "empty"—a node with only CloudWatch/GuardDuty agents IS considered empty
do-not-disrupt annotation: Pods with this annotation prevent their node from being consolidated
Combined effect: Running CI jobs are protected by the annotation, but nodes scale down within 5 minutes of jobs completing

Lesson 2: Add the `do-not-disrupt` Annotation

For explicit job protection, annotate your job pods:

[runners.kubernetes.pod_annotations]
  karpenter.sh/do-not-disrupt = "true"

This tells Karpenter: "Never consolidate a node while this pod is running." When the CI job completes and the pod terminates, the annotation goes with it, and the node becomes eligible for consolidation.

Lesson 3: Avoid Burstable Instances for CI/CD Job Pods

We initially included t3/t3a instances in our NodePool for CI job pods. Bad idea.

CI/CD workloads (especially Maven/Gradle builds) are CPU-intensive and will exhaust the burst credits quickly. Once credits are gone, you're throttled to 20% baseline CPU, and your 10-minute build becomes a 50-minute build.

Use m-series or c-series instances that provide consistent CPU performance for job pods:

requirements:
  - key: 'node.kubernetes.io/instance-type'
    operator: In
    values:
      # Good: Fixed-performance instances for CI job pods
      - 'm6a.large', 'm6a.xlarge'
      - 'c6a.large', 'c6a.xlarge'
      # Bad: Burstable instances for CI jobs (avoid these)
      # - 't3.large', 't3a.xlarge'  # DON'T USE for job pods

Note: This advice applies to the Spot NodePool where CI jobs run. We do use a t3a.medium for the GitLab Runner manager itself—the lightweight, always-on process that coordinates jobs. The runner manager barely uses any CPU (it's just polling GitLab for jobs and creating pods), so it stays well within its burst credits. In fact, we downsized the runner manager from a c7a.medium to a t3a.medium during the migration, saving an additional ~$40/month on that single instance alone. The key distinction: burstable is fine for control plane workloads, not for compute-heavy CI jobs.

Lesson 4: Node Labels Must Match Job Selectors

The Auto Mode NodePool must include labels that match your job pod's nodeSelector:

# NodePool template
metadata:
  labels:
    node-type: spot  # Must match...

# GitLab Runner config
[runners.kubernetes.node_selector]
  node-type = "spot"  # ...this selector

If these don't match, Karpenter won't provision nodes for your jobs—pods will stay pending forever.

Lesson 5: Diverse Instance Types for Spot Availability

Don't just specify one or two instance types. Spot capacity varies by instance type and availability zone. More options = higher chance of getting capacity:

values:
  - 'm6a.large', 'm6a.xlarge', 'm6a.2xlarge'
  - 'm7a.large', 'm7a.xlarge', 'm7a.2xlarge'
  - 'c6a.large', 'c6a.xlarge', 'c6a.2xlarge'
  - 'c7a.large', 'c7a.xlarge', 'c7a.2xlarge'

Karpenter will automatically select from available capacity at the best price.

Drawbacks and Considerations

No solution is perfect. Here are the trade-offs:

1. Spot Interruptions

Spot instances can be reclaimed with 2 minutes notice. For CI/CD:

Mitigated by: Diverse instance types (Karpenter will find alternatives)
Mitigated by: Most CI jobs are under 30 minutes
Mitigated by: GitLab automatically retries failed jobs
Reality: We've seen <5% interruption rate in eu-west-1

2. Cold Start Latency

New nodes take 45-90 seconds to provision vs. 30-60 seconds for Fargate pods. In practice, this is negligible because:

Karpenter keeps nodes running for consolidateAfter duration
Subsequent jobs reuse warm nodes
Only the first job after idle periods sees the delay

3. Increased Complexity

Auto Mode is simpler than self-managed Karpenter, but still more complex than Fargate:

You need to understand NodePools and disruption settings
Debugging requires node-level visibility occasionally
More configuration knobs to tune

4. Cost Visibility

With Fargate, costs are per-pod and easy to attribute. With shared nodes, cost allocation becomes fuzzier. AWS Cost Explorer shows EC2 costs, but not which CI jobs caused them.

5. Shared Node Security

Multiple CI jobs share the same node. If you run untrusted code:

Use Pod Security Standards (restricted profile)
Consider separate NodePools for different trust levels
Or stick with Fargate for strict isolation

When to Stick with Fargate

Despite our migration, Fargate is still the right choice for:

Low-volume CI/CD: If you run <10 pipelines/day, Fargate's simplicity wins
Strict isolation requirements: Each Fargate pod is a separate microVM
Unpredictable workloads: Fargate scales to zero perfectly
Teams without Kubernetes expertise: Fargate abstracts more complexity
Running untrusted code: MicroVM isolation is stronger than pod isolation

Conclusion

Migrating from EKS Fargate to EKS Auto Mode with Spot instances reduced our CI/CD infrastructure costs by 40%. The key enablers were:

EKS Auto Mode: Managed Karpenter without the operational burden
Spot instances: 70% cheaper than on-demand, 80%+ cheaper than Fargate
CDK Infrastructure as Code: Reproducible, version-controlled, and safe to update

The migration wasn't without challenges—understanding Karpenter's disruption budgets took some debugging. But with the right configuration, we now have a cost-effective, reliable CI/CD infrastructure that scales automatically.

Key takeaways:

Use WhenEmpty with budgets: [{ nodes: '100%' }] for proper consolidation
Add do-not-disrupt annotations to protect running jobs
Avoid burstable instances (t3/t3a) for CPU-intensive CI workloads
Specify diverse instance types for Spot availability

If your Fargate bill is growing and you're comfortable with Kubernetes, EKS Auto Mode is worth serious consideration. The 40% savings we achieved compound quickly, and the operational overhead is minimal thanks to AWS managing the hard parts.

Have questions about this migration? Found a better approach? I'd love to hear from you.

Resources

I'm Now an AWS Community Builder — Here's What That Means

Pawan Sawalani — Fri, 06 Mar 2026 10:36:14 GMT

A few days ago, I received an email I'd been hoping for: I've been accepted into the AWS Community Builders program under the Dev Tools category.

In this post, I want to share what the program is, why I applied, and what I plan to contribute over the coming year.

What is the AWS Community Builders Program?

For those unfamiliar, AWS Community Builders is an invite-only program by Amazon Web Services that connects cloud enthusiasts, content creators, and practitioners with AWS product teams and other community members. It's designed for people who are actively building on AWS and sharing their knowledge — through blogs, talks, open source contributions, or community engagement.

As a member, you get access to private Slack channels with AWS engineers and product teams, early visibility into upcoming services and features (under NDA), AWS credits, mentorship, and a community of like-minded builders from around the world.

Why I Applied

I've been building on AWS for nearly a decade now. What started as setting up basic EC2 instances has evolved into designing and managing multi-account AWS organisations — complete with PCI DSS/NIST/ISO/SOC2 compliance, and a security-first DevSecOps culture.

For most of that journey, I was heads-down building. But over the past year or two, I've started writing more — documenting the things I've learned, the decisions I've made, and the problems I've solved. This blog has been a big part of that shift.

When I came across the Community Builders application, it felt like a natural next step. I wanted to move from building in isolation to building in the open — learning from others, getting feedback, and contributing back to a community that's given me so much through the years (blog posts, re:Invent talks, open source tools, and countless Stack Overflow answers).

What I Plan to Share

Being part of the Community Builders program is a commitment to keep sharing and learning. Here are the areas I'm most excited to write and talk about:

Containerisation Journeys

I've been deeply involved in migrating workloads from EC2 to containers. This involves evaluating ECS vs. EKS, designing Helm charts, building reusable CI/CD components for container workflows, setting up preview environments, and figuring out observability with CloudWatch Application Signals and Container Insights. There's no shortage of real-world lessons to share here.

Security-First Cloud Architecture

Building for regulated industries means every architectural decision has a compliance dimension. I want to share practical patterns for achieving PCI DSS/ISO/NIST/SOC2 compliance on AWS — not just the theory, but the actual implementation details that are hard to find in documentation.

AI-Powered DevSecOps

This is where things get really interesting. I've been building AI agents that automate parts of the DevSecOps workflow — from vulnerability triage and remediation suggestions, to compliance monitoring, to intelligent pipeline analysis. The combination of AWS Bedrock, LangGraph, and MCP servers opens up powerful possibilities.

Kubernetes in Production

I have operated EKS clusters across multiple environments. From cluster upgrades and Karpenter autoscaling to running CI/CD runners on EKS with Spot instances (achieving significant cost savings), there's a lot of practical Kubernetes content I plan to share.

Looking Ahead

I'm genuinely excited about this. The AWS Community Builders program isn't just a badge — it's an opportunity to connect with people who care about the same things I do: building reliable, secure, well-architected systems on the cloud.

If any of these topics resonate with you, I'd love to connect. You can find me here on the blog, on LinkedIn, or reach out directly.

Here's to a great year of building, sharing, and learning together.

— Pawan

Level Up Your Cloud Security: My Playbook for DevSecOps Acceleration with AWS LZA

Pawan Sawalani — Tue, 13 May 2025 19:53:51 GMT

Introduction: The Quest for Secure and Agile Cloud Operations

Let's be honest, scaling cloud operations is exciting, but keeping everything secure and agile as you grow? That’s where the real challenge begins. Juggling multiple AWS accounts, ensuring consistent security policies, and empowering developers without opening Pandora's Box – it’s a familiar story for many of us in the tech trenches.

In our organization, we hit a point where the sheer complexity of managing our expanding AWS footprint was becoming a bottleneck. We were grappling with ensuring consistent security baselines across new projects and maintaining governance without stifling the very innovation the cloud promises. We needed a better way to establish a secure foundation, one that could keep pace with our DevSecOps ambitions. This wasn't just about adding more tools; it was about fundamentally rethinking our approach to cloud platform management. The ad-hoc solutions and manual interventions that worked for a handful of accounts were clearly not sustainable as we scaled. This realization pushed us to look for a more structured, automated, and inherently secure way to manage our AWS estate.

That's when we discovered the AWS Landing Zone Accelerator (LZA). And let me tell you, it wasn't just another tool; it was a pivotal shift in how we approached cloud governance and security. This blog post is my story – our story – of how LZA didn't just help us build a secure baseline, but how it became a powerful accelerator for our DevSecOps practices. We'll dive into what LZA is, the tangible benefits we've seen, and why I believe it's a critical enabler for any organization serious about secure, scalable cloud operations. The journey to LZA was driven by a clear need to move beyond reactive firefighting to proactive, strategic platform building.

Whether you're a cloud architect designing resilient infrastructures, a security engineer fortifying defenses, a developer aiming for faster, secure deployments, or just curious about taming cloud complexity, I think you'll find some valuable takeaways here. The challenges we faced are common, and the solutions LZA offers address fundamental aspects of cloud maturity.

The Multi-Account Tightrope: Why Managing AWS at Scale Needs a Safety Net

As your AWS footprint grows, so does the complexity. What starts as a manageable handful of accounts can quickly morph into a sprawling ecosystem. Without a robust strategy, you're walking a tightrope. The allure of agility and innovation that draws us to the cloud can be quickly hampered if the underlying management of that environment doesn't keep pace.

We certainly felt this pressure. One of the first major hurdles we encountered was inconsistent security postures. Each new account or project, often spun up with the best intentions to meet urgent business needs, risked becoming an island, potentially drifting from our organization's core security standards. Ensuring every team adhered to the same critical security configurations, like encryption standards or network access controls, became a constant, manual battle. This inconsistency wasn't just a theoretical risk; it translated into real vulnerabilities and increased our audit burden.

Then there was the governance overhead. Manually enforcing governance policies, managing Identity and Access Management (IAM) at scale, and ensuring compliance across dozens of accounts? It’s a recipe for burnout and, worse, security gaps. Our central security and operations teams were stretched thin, trying to keep up with the demands of a rapidly expanding environment. The complexity of IAM, in particular, became a significant challenge, with the potential for over-privileged roles or inconsistent access patterns across accounts.

This operational burden directly led to slow provisioning and innovation drag. The time it took to provision new, secure environments for development teams started to hinder our agility. What should have been a quick turnaround to support a new initiative often involved lengthy manual setup and verification processes. Instead of accelerating innovation, our foundational setup was becoming a drag, a source of frustration for developers eager to build and deploy.

Now, don't get me wrong, a multi-account strategy is an AWS best practice for good reasons – resource isolation, security boundaries, simplified billing, and limiting the blast radius of any potential security incident are all crucial. We understood these benefits and were committed to them. But the advantages can quickly be overshadowed by the operational nightmare of managing it all without the right framework. The very structure designed to enhance security and organization can, ironically, introduce new complexities if not managed properly.

Before LZA, we were investing significant engineering effort into simply maintaining the status quo, building custom scripts, and performing manual checks to keep our multi-account environment somewhat consistent. It felt like we were constantly playing catch-up, reacting to issues rather than proactively building a secure and scalable platform. This reactive mode is antithetical to a DevSecOps culture, which thrives on proactivity and automation. The time spent on these manual, foundational tasks was time not spent on embedding security deeper into our development lifecycles or exploring new ways to innovate securely. This realization was a key driver in our search for a more comprehensive solution.

Enter AWS Landing Zone Accelerator: Our Foundation for Secure Innovation

So, what exactly is this AWS Landing Zone Accelerator or LZA? Think of it as an architectural blueprint and an automation engine, designed by AWS, to help you deploy a secure, resilient, and scalable multi-account AWS environment, fast. It’s not just about creating accounts; it’s about establishing a comprehensive cloud foundation aligned with AWS best practices and numerous global compliance frameworks, such as NIST, CMMC, and HIPAA, depending on the configuration. This alignment provides a significant head start for organizations in regulated industries.

Several key characteristics define LZA and how it operates. Crucially, LZA is provided as an open-source project built using the AWS Cloud Development Kit (AWS CDK). This is a massive win because it means your entire foundational environment – networking, security services, account structures – is defined as code. This Infrastructure as Code (IaC) approach is fundamental to achieving automation, version control, and repeatability, which are cornerstones of modern cloud management and DevOps practices.

It's often recommended to deploy AWS Control Tower as your foundational landing zone and then enhance it with LZA. AWS Control Tower provides an easy way to set up and govern a new, secure, multi-account AWS environment with baseline guardrails. LZA then builds upon this, offering a powerful, highly customizable solution across a vast array of AWS services (over 35, in fact!) for managing more complex environments and specific compliance needs. This layered approach allows organizations to start with Control Tower's simplicity and then graduate to LZA's advanced capabilities as their requirements evolve.

You manage LZA through a simplified set of configuration files, typically written in YAML. These files allow you to define and manage various aspects of your multi-account environment, including foundational networking topology with Amazon Virtual Private Clouds (VPCs), AWS Transit Gateways, and AWS Network Firewall, as well as security services like AWS Config Managed Rules and AWS Security Hub. This configuration-driven approach abstracts away much of the underlying complexity, allowing for powerful customizations without necessarily requiring deep coding expertise for every adjustment.

When we first deployed LZA in our organization, the immediate impact was profound. Suddenly, we had a robust, secure baseline established across our accounts, almost out-of-the-box. This wasn't just a minor improvement; it was a turning point for us. The consistency and pre-configured security controls, such as centralized logging, identity and access management configurations, and network security setups, gave us a level of confidence we hadn't had before. The ability to manage this foundation as code, using the AWS CDK, was the real 'aha!' moment for our engineering teams. It aligned perfectly with our DevOps mindset and immediately clicked. The transparency and control offered by an open-source, CDK-based solution meant we could understand, customize, and truly own our cloud foundation, rather than treating it as an opaque managed service.

Beyond the technical achievements, we saw several key benefits:

Speed and Efficiency: Setting up new, secure accounts and environments went from weeks of manual toil to a streamlined, automated process. This dramatically reduced the labor overhead and lead time associated with onboarding new projects or teams.
Built-in Security & Compliance: Knowing that our foundation was aligned with AWS Well-Architected principles and designed to support various compliance frameworks gave our security and Governance, Risk, and Compliance (GRC) teams immense peace of mind. LZA provides the foundational infrastructure from which additional complementary solutions can be integrated to meet specific compliance goals.
Scalability: LZA is built for scale. We knew that as we grew, our foundational governance and security would scale with us, not become a bottleneck. The architecture supports managing and governing a multi-account environment suitable for highly-regulated workloads and complex compliance requirements.

LZA: The DevSecOps Supercharger

For us, LZA wasn't just about better infrastructure management; it was a direct catalyst for accelerating our DevSecOps adoption. DevSecOps, at its heart, is about integrating security into every phase of the development lifecycle, making it a shared responsibility across development, security, and operations teams. LZA provides the secure and automated playground for this to happen effectively. It addresses the foundational layer, ensuring that the environment where DevSecOps practices are applied is itself secure, consistent, and manageable. This allows teams to focus on application-level security and agile delivery, rather than constantly wrestling with the underlying infrastructure.

A. Shifting Security Left, Effortlessly

One of the core tenets of DevSecOps is "shifting security left" – addressing security concerns as early as possible in the development lifecycle, ideally from the moment developers start coding. LZA embodies this principle at the foundational level. Instead of bolting on security later or discovering misconfigurations in production, LZA provisions environments with pre-configured security services like AWS Security Hub, Amazon GuardDuty, AWS Config rules, AWS Network Firewall, and robust IAM policies from day one.

Our Experience: We found that LZA naturally pushed our security considerations earlier into the development lifecycle. Developers receive accounts that already have baseline security measures, detective controls, and preventative guardrails (like SCPs) in place. This significantly reduces the risk of insecure configurations slipping through due to oversight or lack of awareness. For example, network configurations are established with security in mind, and default IAM roles are designed with least privilege.

Impact: This proactive stance means fewer security vulnerabilities make it to later stages of development or, worse, into production. This saves us significant time and effort in remediation, reduces the cost of fixing bugs (which increases the later they are found), and ultimately lowers our risk profile. The platform itself becomes an enabler of secure development, rather than an obstacle.

B. Automation as a Force Multiplier

DevOps (and by extension, DevSecOps) thrives on automation. LZA brings extensive automation to the often-manual and error-prone process of setting up and managing a multi-account AWS foundation. Being built on AWS CDK, the entire LZA deployment and configuration update process can be managed through AWS CodePipeline. This means changes to your core infrastructure—like adding new security controls, modifying network routes, or updating SCPs—are deployed in a consistent, auditable, and repeatable manner.

Our Experience: The automation LZA brought to provisioning and managing our foundational environment freed up significant engineering time. Our platform team, which was previously bogged down in manual setup and troubleshooting, could now focus on higher-value tasks like developing new platform capabilities or supporting development teams more directly. Our development teams, in turn, received the resources they needed much faster, accelerating their own workflows.

Impact: This level of automation not only boosts speed but also drastically reduces the risk of human error – a common source of security misconfigurations. When the "right way" to configure something is the automated way, consistency and adherence to standards improve dramatically. This automated, stable base is critical for then building automated application security testing and deployment pipelines on top.

C. Security as Code in Practice

The principle of "Security as Code" means treating your security configurations with the same rigor as your application code – versioning it, testing it, and automating its deployment. LZA makes this a reality for your cloud foundation. With LZA, security policies, IAM roles and permissions, network configurations (like VPCs and firewall rules), and compliance guardrails are defined in configuration files (YAML) and deployed via the AWS CDK. Even complex IAM setups, including federation with identity providers and granular permission sets, can be managed this way.

Our Experience: For us, being able to define and version our security posture as code was a huge win. It simplified audits immensely because the "as-is" state of our security configurations could be easily compared against the "to-be" state defined in code. It made rollbacks safer and more predictable if a change had unintended consequences. Most importantly, it fostered better collaboration between our security, operations, and even development teams because the "rules of the road" were clearly codified and accessible.

Impact: This approach aligns perfectly with GitOps workflows, where the Git repository becomes the single source of truth for your infrastructure and security configuration. Changes go through pull requests, reviews, and automated pipeline deployments, bringing a new level of discipline and transparency to security management. This dramatically reduces configuration drift and enhances the overall auditability of the environment.

D. Centralized Governance that Empowers, Not Restricts

Effective governance in a DevSecOps world isn't about locking everything down; it's about establishing clear guardrails that allow teams to innovate safely within well-defined boundaries. LZA provides the tools for this centralized governance. Through its deep integration with AWS Organizations, Service Control Policies (SCPs), and centralized logging and monitoring (via services like AWS CloudTrail, AWS Config, AWS Security Hub, and Amazon GuardDuty), LZA gives you a holistic view and control over your entire AWS environment.

Our Experience: We use LZA-managed SCPs to enforce critical security boundaries—for example, restricting the use of certain AWS Regions or denying access to specific services that don't align with our security policies. This is done at the organizational unit (OU) level, providing broad enforcement without micromanaging individual accounts. Centralized logging, with logs from all accounts aggregated into a dedicated Log Archive account, has also been invaluable for security monitoring, threat detection, and incident response.

Impact: This centralized approach ensures consistency and compliance across the organization, while still allowing development teams the autonomy they need within those well-defined boundaries. It’s about enabling speed with safety.Developers can experiment and deploy resources, confident that the foundational guardrails are in place to prevent egregious errors or policy violations. This "trust but verify" model, enabled by strong automated controls, is key to fostering agility in a DevSecOps context.

The synergy of these elements—shifting security left for the platform, automating foundational controls, codifying security policies, and enabling intelligent governance—creates an environment where DevSecOps principles aren't just aspirations but are actively supported and reinforced by the underlying cloud infrastructure. This holistic impact is what truly accelerates DevSecOps maturity.

Our LZA Journey: Key Wins and Real-World Impact

Beyond the general DevSecOps acceleration, I want to share some specific, tangible wins we experienced after implementing LZA. These are the results that really brought its value home for us, transforming how we operate and innovate on AWS.

Drastic Reduction in Secure Environment Provisioning Time One of the most immediate and impactful wins was the dramatic reduction in time to provision new, secure development and test environments. What used to take our platform team days, sometimes even weeks, of manual configuration, cross-team approvals, and painstaking checks, now happens in a fraction of that time, fully automated. I'd estimate we cut down provisioning time by over 70% for a standard project environment. This wasn't just about speed; it was about consistency. Every new environment now adheres to our security baseline automatically, thanks to LZA's IaC approach. This agility has been a massive boost for our project teams, allowing them to get started on new initiatives much faster.

Enhanced Security Posture & Compliance Readiness Our security team sleeps better at night, and I'm not exaggerating! The consistent application of security controls—like pre-configured Security Groups, Network ACLs, centralized AWS Network Firewalls, and integration with services like Amazon GuardDuty for threat detection and AWS Security Hub for a unified view of security alerts—has significantly improved our overall security posture. Furthermore, automated compliance checks via AWS Config rules, orchestrated by LZA, provide continuous monitoring against our defined standards. When audit season comes around, we're far more prepared because much of the evidence is automatically gathered, and our configurations are codified, versioned, and easily auditable. This has streamlined our interactions with auditors and reduced the stress associated with compliance reporting.

Developer Empowerment & Increased Velocity Perhaps counterintuitively for a governance tool, LZA actually empowered our developers. By providing them with secure, pre-approved environments and clear, automated guardrails (through SCPs and detective controls), they could innovate faster without the constant fear of accidental misconfiguration or unintentional policy violation. The "safe sandbox" LZA creates has boosted their velocity and encouraged experimentation. They understand the boundaries, and within those boundaries, they have the freedom to operate. This has fostered a more positive relationship between development and security teams, as security is seen more as an enabler than a blocker.

A Lesson Learned or Pro-Tip: One lesson we learned is the importance of investing time upfront in understanding and customizing the LZA configuration files (the YAML files) to truly match your organization's specific needs. While the defaults provided by LZA are excellent and align with general best practices, tailoring aspects like your specific network design, OU structure, or fine-grained IAM permission sets early on pays huge dividends in the long run. Don't just deploy and forget; treat your LZA configuration as a living part of your infrastructure that you iterate on as your needs evolve. This iterative approach ensures the landing zone remains aligned with your business and technical requirements.

My Opinionated Stance: From my perspective, LZA isn't just a 'nice-to-have' for organizations serious about AWS; it's rapidly becoming a foundational necessity for anyone looking to scale securely and embrace DevSecOps. The initial learning curve is there, yes—understanding the configuration files and the CDK structure takes some effort. But the long-term benefits in terms of security, operational efficiency, governance, and developer enablement far outweigh that initial investment. The shift from manual, reactive management to an automated, proactive, code-driven approach to our cloud foundation has been a game-changer.

Getting Your Hands on LZA: It's More Accessible Than You Think

GitHub repo - https://github.com/awslabs/landing-zone-accelerator-on-aws

If you're thinking this sounds powerful but perhaps overwhelmingly complex to implement, there's good news. You're likely not starting from absolute zero, especially if you're already using or considering AWS Control Tower.

As mentioned, LZA is designed to enhance an AWS Control Tower setup. Control Tower lays down the initial multi-account structure and baseline guardrails, providing a guided, user-friendly way to get started with a well-architected environment. LZA then comes in to add layers of advanced customization, more granular security controls, sophisticated networking configurations, and alignment with specific, often stringent, compliance frameworks. So, if you have Control Tower, you have a solid launching pad for LZA.

A huge plus is that LZA is an open-source project, available on GitHub under the awslabs organization. You can find the code, explore how it works, see how it's structured, and understand the underlying automation. This transparency is invaluable. It means a community is building around it, sharing best practices, configurations, and solutions to common challenges. Being open source also means you're not locked into a proprietary black box; you have the ability to understand and, if necessary, adapt the solution.

Because it's built on the AWS Cloud Development Kit (CDK), if your team has experience with common programming languages like TypeScript or Python (the primary languages supported by CDK), they can understand, manage, and even extend the LZA codebase. This is a significant advantage over purely template-based solutions (like raw CloudFormation) or GUI-driven configurations, as it allows you to apply software development best practices to your infrastructure management. This accessibility of the codebase can also help bridge the skill gap between infrastructure and development teams, fostering better collaboration.

Ready to explore further? AWS provides extensive documentation, including an Implementation Guide, a Solution Overview, and sample configurations that can help you get started. The GitHub repository itself is a goldmine of information, containing not just the source code but also issue trackers where you can see ongoing development, community discussions, and known challenges. These resources can significantly flatten the learning curve.

While LZA automates a tremendous amount, deploying and customizing it effectively does require an investment in learning and planning. It's not a magic button, but it is a powerful accelerator. You'll need to understand your organization's specific security, networking, and compliance requirements to tailor the LZA configuration files effectively. For us, the effort invested upfront in planning and understanding LZA's capabilities was well worth the outcome in terms of long-term stability, security, and operational efficiency. The move towards managing our foundational infrastructure as code with LZA has democratized access to what was previously a very complex and specialized domain, allowing more of our team to contribute to and understand our cloud platform.

Conclusion: Building a Secure, Agile Future with LZA-Powered DevSecOps

Our journey with AWS Landing Zone Accelerator has been transformative. It provided the secure, automated, and governed foundation we desperately needed to scale our AWS environment effectively. More importantly, it has been a powerful catalyst for our DevSecOps maturity, enabling us to integrate security more deeply and efficiently into our cloud operations and development lifecycles.

By baking in security from the start, automating foundational configurations, enabling security as code, and providing robust centralized governance, LZA has allowed us to move faster, more securely, and with greater confidence. The shift from a reactive, often manual approach to a proactive, automated, and code-driven paradigm for our cloud foundation has unlocked new levels of agility and resilience. It has allowed us to focus more on innovation and less on the undifferentiated heavy lifting of managing a complex multi-account environment.

In today's cloud landscape, speed and security are not mutually exclusive – they are prerequisites for success. Tools like AWS LZA are vital in bridging that gap, turning complex challenges into manageable, automated processes. It exemplifies a broader industry trend towards codifying and automating all aspects of IT infrastructure, with security as an integral component from the outset.

If your organization is navigating the complexities of a multi-account AWS environment and striving to accelerate your DevSecOps adoption, I wholeheartedly recommend taking a serious look at the AWS Landing Zone Accelerator. It certainly changed the game for us, and I believe it can do the same for you. Start by exploring the AWS documentation and the GitHub repository – your future, more secure and agile cloud self will thank you. The initial investment in learning and configuration will pay dividends in the form of a more robust, compliant, and innovation-friendly cloud platform.

Automating AWS EC2 CloudWatch Agent Monitoring & Email Alerting with Lambda and CDK

Pawan Sawalani — Fri, 02 May 2025 13:59:47 GMT

Ensuring that the CloudWatch Agent is installed and running on all EC2 instances is crucial for complete observability. In this guide, we’ll explore why monitoring the CloudWatch Agent matters, why you should alert on missing/stopped agents, and how to implement an automated checker using an AWS Lambda function. We’ll then walk through deploying the solution with AWS CDK (TypeScript) for a repeatable setup. The language will be friendly and the approach hands-on, so you can follow along easily.

Why CloudWatch Agent Monitoring is Important for EC2

AWS EC2 instances by default only send a limited set of metrics to CloudWatch (CPU, network, etc.). The CloudWatch Agent extends this by collecting additional system metrics and logs from inside the instance . For example, the CloudWatch Agent can report memory usage, disk utilization, detailed OS metrics, and even custom application metrics, which are not available through the default EC2 monitoring . It also streams system logs (and custom log files) to CloudWatch Logs for centralized logging .

By running the CloudWatch Agent on your servers, you get a much more comprehensive view of instance health. Memory consumption, disk space, swap usage, and application logs are critical for diagnosing issues, and the CloudWatch Agent gathers these with minimal effort. In short, CloudWatch Agent monitoring is vital because it ensures that you’re not flying blind on important metrics and logs that go beyond the basic EC2 data.

Why You Need Alerts for Missing or Stopped Agents

If the CloudWatch Agent is missing or stopped on an EC2 instance, you effectively lose visibility into that instance’s detailed metrics and logs. This is a serious blind spot: imagine a scenario where an application is consuming all memory on a server, but you have no CloudWatch metrics or logs to alert you because the agent that collects them isn’t running. By the time you realize there’s a problem, it might be too late to prevent an outage.

Setting up alerts for CloudWatch Agent status ensures that you are notified as soon as an agent isn’t running when it should be. This proactive alerting allows your team to quickly remediate the issue (install or restart the agent) before it impacts monitoring or operations. It’s essentially monitoring your monitoring – a safety net that catches misconfigurations or failures in the telemetry pipeline. In real-world use, this kind of alert can save hours of troubleshooting during incidents, because you’ll immediately know if lack of metrics is due to an agent issue. It also helps maintain compliance with any internal policies that all instances must have monitoring active. The bottom line: if the CloudWatch Agent stops, you want to know right away so you can fix it and restore full visibility.

Building an Automated CloudWatch Agent Status Checker (Lambda + SSM)

To automatically detect and alert on CloudWatch Agent issues, we’ll build a Python AWS Lambda function that runs on a schedule. This function will use AWS Systems Manager (SSM) to remotely check each EC2 instance and verify the CloudWatch Agent’s status. If the agent is not installed or not running on any instance, the Lambda will send an email alert (via Amazon SES) in Markdown format summarizing the problem. Here’s how it works:

Discovering EC2 Instances via SSM

First, the Lambda function needs to know which EC2 instances to check. We leverage AWS Systems Manager for this, since SSM can enumerate instances that have the SSM Agent running. Using the SSM API describe_instance_information (or its boto3 equivalent), the Lambda can list all managed instances. We typically filter this to instances that are currently online with SSM (PingStatus = “Online”) . This ensures we target only instances that are up and have the SSM agent available to run commands. You could also filter by tags (for example, only check instances with a specific tag like Monitoring=true if you don’t want to cover every instance), but the key is that SSM gives us a reliable inventory of instances to probe.

In Python (boto3), this might look like:

ssm = boto3.client('ssm')
# Get all online managed instances
response = ssm.describe_instance_information(
    Filters=[{ 'Key': 'PingStatus', 'Values': ['Online'] }]
)
instances = [info['InstanceId'] for info in response['InstanceInformationList']]

This collects the list of EC2 instance IDs that we will check. (We assume the SSM agent is installed on your instances – which is true for most modern AWS Linux/Windows AMIs – otherwise SSM can’t run commands on them.)

Checking CloudWatch Agent Status with SSM Run Command

For each instance, the Lambda uses SSM Run Command to execute a pre-built document called “AmazonCloudWatch-ManageAgent”. AWS provides this Systems Manager document to manage the CloudWatch Agent (install, configure, or query its status). We’ll invoke it with the “status” action, which tells the agent to report its current status . In effect, this SSM command asks the CloudWatch Agent (via the SSM agent on the instance) whether it’s running, and if so, returns details like the running status and version.

The Lambda uses ssm.send_command for this. For example:

cmd_response = ssm.send_command(
    InstanceIds=[instance_id],
    DocumentName='AmazonCloudWatch-ManageAgent',
    Parameters={
        'action': ['status'],
        'mode': ['ec2']
    }
)
command_id = cmd_response['Command']['CommandId']

A few notes on this command: We specify the action as “status” and mode “ec2” (since these are EC2 instances, not on-premises). We target one instance at a time here by ID (you could target multiple in one command, but handling results is simpler per instance). The response gives us a CommandId which we’ll use to retrieve the execution output.

Under the hood, the AmazonCloudWatch-ManageAgent document will run the amazon-cloudwatch-agent-ctl command on the instance to get the agent status. If the CloudWatch Agent is running, the output will be a small JSON snippet indicating "status": "running" along with the start time and version . If the agent is stopped (not running), the JSON will say "status": "stopped" . In cases where the agent is not installed at all, the SSM command might report an error or simply that the service isn’t running (which effectively is the same outcome — it’s not running). We’ll handle those cases as “not running” as well, since either way the instance isn’t being monitored by the agent.

Retrieving and Parsing the Command Results

SSM Run Command is asynchronous, so after sending the command we need to retrieve the results. We use ssm.get_command_invocation with the Command ID and instance ID to get the output. One important detail here: the AmazonCloudWatch-ManageAgent document may consist of multiple steps/plugins internally, so we should specify the Plugin Name corresponding to the status action when fetching the results. Otherwise, the API might throw an “InvalidPluginName” error if it doesn’t know which step’s output to return . In our case, the plugin (step) name is “status” (since we invoked the status action).

So, the Lambda will do something like:

# (It’s a good practice to wait a few seconds or poll until the command is finished)
result = ssm.get_command_invocation(
    CommandId=command_id,
    InstanceId=instance_id,
    PluginName='status'  # specify the 'status' step output
)
output_text = result.get('StandardOutputContent', '')

The StandardOutputContent will contain the JSON string output from the agent status command. For example, it might be:

{ "status": "running", "starttime": "2025-04-01T12:00:00", "version": "1.300257.0" }

We parse this JSON in the Lambda (e.g., using Python’s json.loads) to easily inspect the fields:

import json
if output_text:
    data = json.loads(output_text)
    agent_status = data.get('status', 'unknown')
else:
    agent_status = 'unknown'

Now, for each instance we have agent_status which will be "running" if the CloudWatch Agent is OK. If the agent is stopped or not installed, we might get "stopped" or no output. We treat any status other than “running” as a problem that needs alerting. (If the SSM command itself failed to execute, we also consider that as the agent not running, since we couldn’t confirm it’s active.)

We can also grab the agent version from the output (the version field) if we want to include it in the report. This could be useful to see what version is running or if an outdated version might be an issue.

Formatting the Markdown Alert and Sending Email via SES

Once the Lambda has checked all instances, it will compile a list of any instances that need attention (i.e., where the CloudWatch Agent is missing or stopped). The alert email will be composed in Markdown format for clarity. For example, the message body might look like:

i-0123456789abcdef (WebServer1) – CloudWatch Agent is STOPPED (not running)
i-0fedcba9876543210 (DatabaseServer) – CloudWatch Agent is NOT INSTALLED

Each bullet highlights the instance (by ID and maybe Name tag if we fetch it via EC2 API for friendliness) and the issue. We use bold text and other Markdown features to make it easy to read in the email. In our Python code, we might assemble this as a string with newline-separated - list items.

Finally, the Lambda uses Amazon SES to send out the email. We can use the ses.send_email API, specifying the From address (which must be a verified SES identity) and the To address(es) for the recipients. We put our markdown-formatted message in the email body. Typically, we send it as a simple text email (many email clients won’t render Markdown, but the formatting ensures it’s still human-readable). Optionally, we could convert the Markdown to HTML and send an HTML email for nicer formatting, but that adds complexity – sending it as plain text Markdown is straightforward and effective.

For example:

ses = boto3.client('ses')
email_body = "## CloudWatch Agent Alert\nThe following instances have issues:\n" + "\n".join(problem_lines)
ses.send_email(
    Source=ALERT_FROM_ADDRESS,
    Destination={'ToAddresses': [ALERT_TO_ADDRESS]},
    Message={
        'Subject': {'Data': '⚠️ AWS CloudWatch Agent Alert'},
        'Body': {'Text': {'Data': email_body}}
    }
)

In the above snippet, problem_lines is a list of strings like the bullet points shown earlier. We included a warning emoji in the subject for visibility, and used a Markdown header “## CloudWatch Agent Alert” in the body as a title. You can customize the content as you see fit (include timestamps, agent versions, suggestions to reinstall, etc.).

Note: Before the Lambda can actually send emails, you’ll need to verify the sender email (or domain) in SES and possibly the recipient as well (if your SES is in sandbox mode). We’ll touch on that in the deployment steps, but it’s an important prerequisite to avoid email delivery issues.

With the Lambda function logic explained, let’s move on to deploying this setup using AWS CDK for a clean, infrastructure-as-code deployment.

Step-by-Step Deployment with AWS CDK (TypeScript)

We will use the AWS Cloud Development Kit (CDK) in TypeScript to deploy the Lambda function, its scheduling, and the necessary permissions. This allows us to define the entire stack in code and easily repeat it in different environments. Below are the main steps:

Define the Lambda Function and Code: In your CDK application, create a Lambda function resource. For example, use new lambda.Function(...) in your Stack, specifying the runtime (Python 3.x), handler (the entry point in your Python code), and code (pointing to the directory or file with your Lambda code). Include any necessary environment variables for the function. Common env vars might be the ALERT_EMAIL_TO (recipient address) and ALERT_EMAIL_FROM (sender address), and perhaps a filter tag for instances if you want that configurable. For instance:
```
 const monitorFn = new lambda.Function(this, 'AgentMonitorFunction', {
   runtime: lambda.Runtime.PYTHON_3_9,
   handler: 'index.handler',
   code: lambda.Code.fromAsset(path.join(__dirname, '../lambda')), // your code directory
   environment: {
     ALERT_EMAIL_TO: 'ops-team@example.com',
     ALERT_EMAIL_FROM: 'no-reply@mycompany.com'
     // ... any other configuration
   }
 });
```
Make sure the ALERT_EMAIL_FROM is an address or domain verified in SES. (You can verify emails via the SES console or CLI; CDK won’t auto-verify it for you.)
Assign IAM Permissions to the Lambda: The function needs permissions to use SSM, EC2 (optional), and SES. You can attach these permissions by adding IAM policy statements or using managed policies:
- SSM Permissions: Allow actions like ssm:DescribeInstanceInformation, ssm:SendCommand, and ssm:GetCommandInvocation. You can scope the SendCommand permission to the specific SSM document ARN for AmazonCloudWatch-ManageAgent if you like, or use a broader permission (for simplicity, many will just allow ssm:* on resources *, but least privilege is recommended). These let the Lambda list instances and execute the status commands.
- EC2 Permissions: If your code looks up EC2 instance tags (e.g., to get the Name tag for friendlier alerts), allow ec2:DescribeInstances (or ec2:DescribeTags). This is optional but useful for enriching alert info.
- SES Permissions: Allow ses:SendEmail (or ses:SendRawEmail) on your SES identity. You can scope it to the Resource of your SES identity ARN. This permission enables the Lambda to actually send the email.
  
  In CDK, you can attach a policy like:
```
  monitorFn.addToRolePolicy(new iam.PolicyStatement({
    actions: [
      "ssm:DescribeInstanceInformation",
      "ssm:SendCommand",
      "ssm:GetCommandInvocation",
      "ec2:DescribeInstances",
      "ses:SendEmail"
    ],
    resources: ["*"]
  }));
```
  Here we grant access to the necessary actions across all resources for brevity. In a production environment, tighten the resources scope if possible (for example, restrict SES to your specific identity ARN, and SSM to the target instance ARNs or the document name). The Lambda’s execution role now has the needed powers to do its job.
Schedule the Lambda with EventBridge (CloudWatch Events): We want the Lambda to run periodically (for example, once a day or every hour, depending on how quickly you want to catch issues). In CDK, create an EventBridge Rule to trigger the Lambda on a schedule. For example:
```
 const rule = new events.Rule(this, 'ScheduleRule', {
   schedule: events.Schedule.cron({ minute: '0', hour: '*/6' })  // every 6 hours, for instance
 });
 rule.addTarget(new targets.LambdaFunction(monitorFn));
```
This will invoke our monitorFn Lambda on the defined schedule (here it’s every 6 hours; you can adjust cron or use Schedule.rate(Duration.days(1)) for daily, etc.). CDK will handle the permissions so EventBridge can invoke the Lambda. By scheduling it, we ensure the CloudWatch Agent check runs regularly without human intervention.
Deploy and Verify: Synthesize and deploy the CDK stack (cdk deploy). Once deployed, check the AWS Console:
- Verify that the Lambda function is created, and the environment variables are set correctly.
- Verify that the EventBridge rule is in place and targeting the Lambda.
- In the SES Console, make sure the ALERT_EMAIL_FROM address (or its domain) is verified (you should have done this before deployment or you can do it now). If you are in the SES sandbox, also verify the ALERT_EMAIL_TO recipient or move out of sandbox to send to arbitrary emails.
- You can run a quick test by manually invoking the Lambda (e.g., via the Lambda console or CLI) to see if it sends an email. Check your inbox (including spam) for the alert message. It might say that all instances are fine (if none were stopped) or list any issues it found.
Operational Considerations: After deployment, your automated monitoring is in place. Going forward, whenever the CloudWatch Agent is not running on an instance, the Lambda will fire an email alert to your team. Ensure that your team knows how to respond (e.g., reinstall or start the agent on the affected instance). You might also consider integrating the alert with a ticketing system or an SNS topic (instead of direct SES emails) if that suits your operations better. The solution is highly customizable – for example, you could extend the Lambda to automatically attempt to restart the agent by running the SSM document with action: restart when it detects an issue, in addition to sending an alert.

Conclusion

By implementing this automated monitoring, you gain peace of mind that your CloudWatch Agents are continuously monitored just like the rest of your infrastructure. The Lambda + SSM approach effectively asks each instance “Hey, is your CloudWatch Agent OK?” on a schedule, and immediately notifies you if the answer is no. This proactive alerting brings real-world benefits: you’ll catch missing or crashed agents early, before they lead to missing metrics or logs during a crucial moment. In practice, this means more reliable monitoring, faster troubleshooting, and a more robust AWS environment.

In summary, we covered why CloudWatch Agent is important for EC2 monitoring and why you should alert on any gaps. We then built a Lambda function that checks agent status using SSM (leveraging the same AWS-recommended commands you could run manually ) and sends out markdown-styled email reports. Finally, we deployed the whole stack using AWS CDK, making it easy for DevOps and platform engineers to set up in their own accounts.

Real-world motivation: Think of this as a watchdog for your watchdog. It’s a simple investment of time that can pay off big by ensuring your monitoring infrastructure remains healthy. No one wants to discover during an outage that the reason you have no metrics is because the monitoring agent was down – with this solution, such surprises are a thing of the past. By implementing CloudWatch Agent alerting, you’re moving your ops culture toward one of preventative monitoring and greater reliability.

Alerting for AWS EC2 Instances Not Managed by SSM Using AWS Config and CDK

Pawan Sawalani — Sat, 19 Apr 2025 11:31:12 GMT

Monitoring your EC2 instances to ensure they’re managed by AWS Systems Manager (SSM) is crucial for security, compliance, and smooth operations. In this post, we’ll explain why SSM-managed instances are so important, how instances can become “unmanaged,” and walk through a step-by-step guide to automatically alert you about any EC2 instance that isn’t managed by SSM. We’ll use a combination of AWS Config, Amazon EventBridge, AWS Lambda, and Amazon SES – all provisioned with the AWS CDK in TypeScript – to build an automated compliance alert system. By the end, you’ll have a CDK stack that detects non-compliant EC2 instances and emails you their details, helping you uphold DevSecOps best practices and maintain AWS environment hygiene.

Why Ensure EC2 Instances Are Managed by SSM

AWS Systems Manager (SSM) is a service that allows you to automate operational tasks, manage instances at scale, and perform detailed system-level monitoring for EC2 instances . When your instances are SSM-managed (often called “managed instances”), you unlock a host of benefits:

• Security & Patch Management: SSM lets you automate patching and apply updates to your instances, keeping them secure. With SSM’s Patch Manager and Automation documents, you can ensure instances receive timely security patches and configuration updates. Instances not in SSM may miss critical updates, leaving vulnerabilities unaddressed.

• Compliance & Auditing: Many compliance frameworks (e.g., NIST CSF) and internal policies require centralized management of servers. By utilizing AWS Systems Manager, you ensure EC2 instances are continuously monitored for security compliance, patch management, software inventory, and regulatory adherence . This helps maintain a secure and compliant environment by centralizing and streamlining EC2 management .

• Operational Efficiency: SSM provides tools like Run Command, Session Manager, and Inventory. These enable you to remotely execute commands, manage configurations, and track software inventory across all your instances. If an instance isn’t SSM-managed, you lose these capabilities for that instance, making troubleshooting and administration harder.

• Consistency: SSM-managed instances can be part of Automation workflows and State Manager associations to enforce desired states. Unmanaged instances might drift from your standard configurations.

In short, an EC2 instance managed by SSM (i.e., a “managed instance”) is any instance that has been configured for Systems Manager . Ensuring all instances are managed means you have consistent control and visibility, which is key to good DevSecOps practices.

How EC2 Instances Become “Unmanaged”

Even in well-run environments, it’s possible for EC2 instances to fall out of SSM management and become “unmanaged.” Here are common scenarios:

• SSM Agent Not Running: The AWS SSM Agent is the software on the instance that communicates with Systems Manager. If this agent is not installed, has been accidentally uninstalled, or is stopped/disabled, the instance will not report to SSM. According to AWS Config’s managed rule documentation, an instance is flagged NON_COMPLIANT if the instance is running but the SSM Agent is stopped or terminated . In other words, a running EC2 with no active SSM agent is considered unmanaged.

• Missing IAM Role/Permissions: For an EC2 instance to be managed by SSM, it must have an IAM instance profile with the necessary permissions (typically the AmazonSSMManagedInstanceCore policy). If an instance was launched without this IAM role or the role’s policies were removed, the SSM agent cannot connect to the Systems Manager service. The instance will then drop out of SSM’s managed instances list.

• Network Misconfiguration: SSM Agent needs network access to communicate with AWS endpoints. If the instance has no internet access or no VPC endpoints for SSM, it may fail to connect. For example, an EC2 in a private subnet without the required VPC endpoints (or NAT) for SSM will be unable to register, effectively unmanaged.

• Outdated or Misconfigured Agent: In some cases, an outdated SSM agent might crash or behave unexpectedly, or the instance might be misconfigured (e.g., firewall blocking SSM traffic). This can cause previously managed instances to stop reporting.

• Instances Launched Outside Standard Processes: You might have automation to auto-register instances with SSM, but if someone launches an EC2 instance outside of those processes (for instance, in a new account or region without SSM setup), that instance might come up unmanaged.

It’s important to catch these situations. Unmanaged instances won’t receive your automated patches or configurations, and they won’t show up in SSM Inventory or Compliance views. This could lead to security gaps or operational blind spots. Our goal is to set up an automated alert whenever an instance becomes unmanaged so that you can remediate it (for example, reinstall the agent, fix the IAM role, or shut down the instance if it’s not approved).

How can we detect unmanaged instances automatically?

AWS offers a managed AWS Config rule exactly for this: “EC2 instances managed by Systems Manager.” This Config rule (EC2_INSTANCE_MANAGED_BY_SSM) evaluates all EC2 instances and checks if each is managed by SSM. The rule is marked NON_COMPLIANT if an EC2 instance is running and the SSM Agent is stopped or not communicating . We will use this rule as the trigger for our alerting system.

Solution Overview and Architecture

To automate alerts for unmanaged EC2 instances, we’ll use the following AWS services working in tandem:

• AWS Config Managed Rule: We’ll enable the ec2-instance-managed-by-systems-manager rule, which flags any running EC2 that isn’t SSM-managed as non-compliant.

• Amazon EventBridge (CloudWatch Events): We’ll create an EventBridge rule to listen for compliance change events from AWS Config. Specifically, it will catch events when the above Config rule evaluates an instance as NON_COMPLIANT. By filtering the event, we ensure we react only to instances becoming non-compliant (unmanaged).

• AWS Lambda Function: The EventBridge rule will trigger a Lambda function. This function will gather details about the offending EC2 instance (like its instance ID and Name tag) and send out an alert email.

• Amazon Simple Email Service (SES): Our Lambda will use Amazon SES to send the actual email notification to specified recipients. We’ll use SES because it’s reliable and designed for such use cases. (SNS could be an alternative for simple text notifications, but SES allows a direct email with customizable content.)

• AWS CDK (TypeScript): We’ll implement all the infrastructure above using the AWS CDK, making it easy to deploy repeatably. This includes the Config rule, EventBridge rule, Lambda function code, and necessary permissions and SES setup.

Below is a conceptual diagram of the architecture:

Event-driven architecture: AWS Config (via EventBridge) triggers a Lambda Function when an instance becomes non-compliant, and the Lambda sends an email using Amazon SES.

In this flow, AWS Config continuously evaluates instances against the SSM management rule. If an instance fails (becomes unmanaged), EventBridge immediately triggers the Lambda. The Lambda logs the event (CloudWatch Logs) and sends an email via SES. Using CDK ensures this entire setup is defined as code, version-controlled, and can be deployed in multiple environments – a big win for DevSecOps practices.

💡

DevSecOps Tip: Automating compliance checks and alerts is a best practice to maintain cloud hygiene. As AWS itself notes, having automation to manage server configuration and compliance helps companies save time, improve availability, and lower the risks associated with security . Instead of manual audits, this solution provides continuous monitoring and instant notification, so you can address issues proactively.

Now, let’s dive into the step-by-step implementation.

Prerequisites

Before we start building, make sure you have the following:

• AWS Account & CLI Access: You’ll need an AWS account with permissions to create the resources (Config rules, Lambda, SES, etc.). Configure AWS CLI or AWS CDK with appropriate credentials (e.g., Admin or a role with necessary rights).

• AWS CDK Toolkit: Install the AWS CDK if you haven’t already. For example, via npm: npm install -g aws-cdk. This post uses CDK v2 with TypeScript.

• AWS Config Enabled: Ensure AWS Config is enabled in the region you’re deploying to. You should have a Configuration Recorder and Delivery Channel set up (with an S3 bucket for config data). If not, you can enable AWS Config in the console by choosing a bucket/role. (CDK can create these too, but it’s outside our main scope.)

• SES Email Verification: Decide which email address (or domain) will be used as the sender for alerts, and which address will receive the alerts. In SES, verify the sender email (and recipient as well, if your SES is in sandbox mode). In SES, you need to verify both the sender and recipient email addresses to ensure a secure and controlled email-sending environment when in sandbox. We’ll assume an email (e.g., alerts@example.com) is verified to send, and you have access to an inbox to receive the notifications.

With that out of the way, let’s start building our solution with CDK!

Step 1: Initialize a New AWS CDK Project

First, set up a new CDK project for our infrastructure:

1. Create a project directory (e.g., aws-ssm-alert) and initialize a CDK TypeScript app:

mkdir aws-ssm-alert && cd aws-ssm-alert
cdk init app --language typescript

This will generate a baseline CDK project with TypeScript. You’ll see files like package.json, cdk.json, a /bin folder with the app’s entry point, and a /lib folder with a stack class.

2. Explore the project structure: After init, your project should look like:

aws-ssm-alert/
├── bin/
│   └── aws-ssm-alert.ts        # CDK app entry point
├── lib/
│   └── aws-ssm-alert-stack.ts  # Your main stack definition
├── package.json
├── cdk.json
├── tsconfig.json
└── etc...

The CDK app in bin/aws-ssm-alert.ts instantiates the stack defined in lib/aws-ssm-alert-stack.ts. We will be writing our infrastructure code in the stack class.

3. Install dependencies: The template already includes aws-cdk-lib and constructs in package.json. Run npm install if it wasn’t run automatically. Also, ensure the CDK libraries are up to date (you can use npm install aws-cdk-lib@latest if needed).

4. Bootstrap (if necessary): If you haven’t used CDK in this AWS account/region before, run cdk bootstrap to set up the necessary CDK resources (this creates a CDK toolkit stack with an S3 bucket for assets, etc.). This only needs to be done once per account/region.

With the CDK project ready, we can start defining our AWS resources in code.

Step 2: Define the AWS Config Rule (SSM Managed Instances)

The core of our alerting system is the AWS Config rule that checks for SSM-managed instances. AWS provides a managed rule for this, so we don’t have to write custom logic – we just need to enable it in our account via CDK.

Open the stack file (lib/aws-ssm-alert-stack.ts) and add the following code to create the AWS Config rule:

import * as config from 'aws-cdk-lib/aws-config';
// ... other imports ...

export class AwsSsmAlertStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // AWS Config rule to check if EC2 instances are managed by SSM
    const ssmManagedRule = new config.ManagedRule(this, 'SSMManagedInstancesRule', {
      identifier: config.ManagedRuleIdentifiers.EC2_INSTANCE_MANAGED_BY_SSM,
      configRuleName: 'ec2-instance-managed-by-systems-manager',
      description: 'Checks if EC2 instances are managed by AWS Systems Manager (SSM).'
    });
    // ... (we will add more resources below) ...
  }
}

A few notes on this code:

• ManagedRuleIdentifiers.EC2_INSTANCE_MANAGED_BY_SSM is a constant for the managed rule identifier. This corresponds to the AWS Config rule that “checks whether the Amazon EC2 instances in your account are managed by AWS Systems Manager” .

• We set configRuleName to ec2-instance-managed-by-systems-manager. This is the display name of the rule (the one you’ll see in the AWS Config console). AWS’s documentation notes that the rule identifier and name differ for this rule ; using the documented name avoids confusion.

• The rule has no additional parameters – it will evaluate all EC2 instances in the account/region by default. The rule triggers on configuration changes (such as an instance starting or an agent status change).

When this rule finds a non-compliant instance (SSM agent off), it will report NON_COMPLIANT to AWS Config. However, just creating the rule doesn’t send any alerts by itself – that’s where the next components come in.

Important: Make sure AWS Config is active. If AWS Config is not yet enabled in this region, deploy a Configuration Recorder and Delivery Channel (or enable via console) before deploying this rule. Without an active recorder, the rule won’t run.

Step 3: Create the Lambda Function to Send Alerts

Next, we’ll create a Lambda function that will be triggered on non-compliance events and send an email alert via SES. We’ll implement the function in Python for simplicity.

Lambda Function Code (Python): Create a new directory (e.g., lambda) in the project, and inside it create a file ssm_alert_function.py with the following code:

import os
import boto3

def lambda_handler(event, context):
    # Extract relevant details from the AWS Config event
    detail = event.get('detail', {})
    instance_id = detail.get('resourceId')
    region = detail.get('awsRegion')
    account = detail.get('awsAccountId')
    rule_name = detail.get('configRuleName')
    compliance = detail.get('newEvaluationResult', {}).get('complianceType')

    # Default values if data is missing
    instance_name = "Unknown"

    # Get the EC2 instance's Name tag (if available) for a human-friendly identifier
    if instance_id and region:
        ec2 = boto3.client('ec2', region_name=region)
        try:
            resp = ec2.describe_instances(InstanceIds=[instance_id])
            reservations = resp.get('Reservations', [])
            if reservations:
                tags = reservations[0]['Instances'][0].get('Tags', [])
                for tag in tags:
                    if tag['Key'] == 'Name':
                        instance_name = tag['Value']
                        break
        except Exception as e:
            print(f"Error fetching instance name for {instance_id}: {e}")

    # Compose the email subject and body
    subject = f"Alert: EC2 instance {instance_id} is NOT managed by SSM"
    body_text = (
        f"EC2 instance {instance_id} (Name: {instance_name}) in account {account}, region {region} "
        f"is {compliance} with AWS Systems Manager compliance (rule: {rule_name}).\n"
        "This means the instance is not managed by AWS Systems Manager (SSM). "
        "Please check the instance's SSM agent and IAM role to restore management."
    )

    # Send the email via Amazon SES
    ses = boto3.client('ses', region_name=region)
    response = ses.send_email(
        Source=os.environ['SENDER_EMAIL'],
        Destination={'ToAddresses': [os.environ['ALERT_EMAIL']]},
        Message={
            'Subject': {'Data': subject},
            'Body': {'Text': {'Data': body_text}}
        }
    )
    print(f"SES send_email response: {response['MessageId']}")
    return {"status": "sent", "messageId": response['MessageId']}

Let’s break down what this function does:

• It expects the event to be an AWS Config compliance change event. It parses out details like instance_id, AWS region, AWS account, the rule_name, and the compliance status. These are nested under event['detail'].

• It then uses the EC2 API (describe_instances) to get the instance’s tags, searching for the Name tag. This is optional but makes the alert more informative (e.g., “Instance i-0123456789 (Name: WebServer1) is not managed by SSM”). If anything goes wrong or the tag isn’t found, we default to “Unknown”.

• Next, it composes an email message. We use environment variables SENDER_EMAIL and ALERT_EMAIL (we will set these in the CDK) to avoid hard-coding addresses. The email includes the instance ID, name, account, region, and a brief note that it’s not managed by SSM.

• Finally, it uses boto3 (AWS SDK for Python) to call ses.send_email() with the given subject and body. The result includes a MessageId which we log for reference.

This is a straightforward function – essentially, gather info and send an email. Using SES through boto3 in Lambda is a common pattern (just ensure your region supports SES and the emails are verified).

Now we need to integrate this Lambda function into our CDK stack:

In the CDK stack code (aws-ssm-alert-stack.ts), add the Lambda function resource:

import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as path from 'path';
import * as iam from 'aws-cdk-lib/aws-iam';
// ... (make sure to import these at top)

  // Within the AwsSsmAlertStack constructor, after defining ssmManagedRule:

  // Lambda function to handle non-compliance events and send email alerts
  const alertFunction = new lambda.Function(this, 'SSMAlertFunction', {
    runtime: lambda.Runtime.PYTHON_3_9,
    handler: 'ssm_alert_function.lambda_handler',
    code: lambda.Code.fromAsset(path.join(__dirname, '../lambda')),  // assumes code is in lambda/ directory
    environment: {
      'SENDER_EMAIL': 'alerts@example.com',      // replace with your verified sender
      'ALERT_EMAIL': 'admin@example.com'         // replace with your recipient email
    }
  });

  // Grant permissions to the Lambda to describe EC2 instances and send SES email
  alertFunction.addToRolePolicy(new iam.PolicyStatement({
    actions: ['ec2:DescribeInstances'],
    resources: ['*']  // describing is read-only; cannot easily restrict by resource ID
  }));
  alertFunction.addToRolePolicy(new iam.PolicyStatement({
    actions: ['ses:SendEmail', 'ses:SendRawEmail'],
    resources: ['*']  // allow sending email via SES from any identity (we will restrict via verified identity in SES itself)
  }));

Key points about this CDK code:

• We package the Lambda code using lambda.Code.fromAsset. This will zip the contents of the lambda directory (which contains our ssm_alert_function.py) and deploy it. Make sure the path is correct relative to your CDK project structure.

• We set the runtime to Python 3.9 (you can use 3.10 or 3.8 as supported) and the handler to ssm_alert_function.lambda_handler (file name and function name).

• In environment, plug in the verified SES sender and the desired recipient. Use your actual emails here: for instance, if you verified ops-team@mycompany.com in SES, set SENDER_EMAIL to that. For ALERT_EMAIL, you can use a distribution list or your email (and ensure it’s verified if in sandbox).

• We then add IAM permissions for the Lambda:

• ec2:DescribeInstances so the function can look up instance tags. (We scope to all resources; you could theoretically scope to the specific instance ARN if the event provided it, but since any instance ID could come through, we allow all. This action is read-only.)

• ses:SendEmail and ses:SendRawEmail so the function can send emails via SES. We allow all resources (*) which means it can use any verified identity in this account. For tighter security, you could specify the ARN of the SES identity (your verified email/domain), but that can vary by region. Simplicity is fine here, given that only our function has this role.

By default, the Lambda gets a basic execution role that allows CloudWatch Logs, so it can write logs (the print statements) to CloudWatch.

Now we have the Config rule and a Lambda function to handle alerts. Next, we connect them with EventBridge.

Step 4: Set Up an EventBridge Rule to Trigger the Lambda

AWS Config emits events on state changes of rules. We’ll create an EventBridge rule (formerly CloudWatch Events rule) that listens for our Config rule becoming NON_COMPLIANT and invokes the Lambda function.

Add the following to the stack code:

import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';
// ... (ensure these imports are present)

  // Within the constructor, after defining alertFunction:

  // EventBridge rule to catch Config rule non-compliance and trigger the Lambda
  const configNonComplianceRule = new events.Rule(this, 'ConfigNonComplianceRule', {
    description: 'Triggers on EC2 SSM Managed Instances rule NON_COMPLIANT evaluations',
    eventPattern: {
      source: ['aws.config'],
      detailType: ['Config Rules Compliance Change'],
      detail: {
        // Only target our specific Config rule by name:
        configRuleName: [ 'ec2-instance-managed-by-systems-manager' ],
        // Only trigger when the new evaluation is NON_COMPLIANT
        newEvaluationResult: {
          complianceType: [ 'NON_COMPLIANT' ]
        }
      }
    }
  });
  // Set the Lambda as the target for the EventBridge rule
  configNonComplianceRule.addTarget(new targets.LambdaFunction(alertFunction));

Let’s explain the event pattern:

• source: We filter events from aws.config only.

• detailType: We specifically look at “Config Rules Compliance Change” events. AWS Config generates this event type when a resource’s compliance status changes for a Config rule .

• detail.configRuleName: We restrict the rule name to our rule (“ec2-instance-managed-by-systems-manager”). This ensures we don’t catch compliance changes from other, unrelated Config rules you might have. (If you want one Lambda to handle multiple rules, you could list multiple names here.)

• detail.newEvaluationResult.complianceType: We only care about transitions to NON_COMPLIANT. This way, when an instance becomes non-compliant (unmanaged), we trigger the Lambda. We ignore transitions to COMPLIANT or other states. By filtering on the complianceType, we handle only the “went non-compliant” direction .

With this pattern, whenever AWS Config flags an instance as not managed by SSM, an event will match and our alertFunction will be invoked. The event data (as we saw in the Lambda code) includes the instance ID, etc.

We’ve now defined all the main components in our CDK stack: the Config rule, the EventBridge rule, and the Lambda with SES permissions. The last piece is making sure SES is configured to allow our emails to be sent.

Step 5: Configure Amazon SES for Email Sending

Amazon SES (Simple Email Service) will actually deliver our alert emails. We need to ensure SES is set up properly:

• Verify the Sender Email: In the AWS SES console, verify the email address you intend to send from (the one we set as SENDER_EMAIL in the Lambda environment). SES will send a verification link to that email; once you confirm it, the email is registered as a verified identity in SES. If you have a domain verified in SES, you could use an address on that domain as well.

• Verify the Recipient Email (if needed): If your SES is in the sandbox (the default for new accounts), you must also verify any recipient email address. Verify the address that you set as ALERT_EMAIL. (In production SES – after you request a sending limit increase – you can send to any address. But initially, sandbox mode restrictions apply.)

• Region Consideration: Note which AWS region you verified the identities in. SES identities are regional. Make sure the Lambda uses the same region for the SES client. In our code, we used region_name=event['detail']['awsRegion'] to initialize the boto3 SES client to the region where the Config event came from (which should be your deployment region). If your SES is in a different region, you might want to explicitly set the region or adjust accordingly. It’s simplest to deploy everything in one region.

If you haven’t used SES before, the AWS documentation provides guidance on verifying identities . The key point is that SES must know you own the sender (and for sandbox, the receiver). After verification, our Lambda’s ses.send_email call will be able to send the message.

📝 Tip: You can test SES sending manually by using the “Send Test Email” feature in the SES console with your verified identities, just to ensure email delivery works, before relying on the Lambda.

Now our infrastructure and configurations are all set. Let’s deploy and see it in action.

Step 6: Deploy the CDK Stack and Test the Setup

With our CDK code ready, we can deploy it to AWS:

1. Synthesize the CloudFormation template: Run cdk synth in your project directory. This will compile your TypeScript and output the CloudFormation template. Check that there are no errors and the resources (Config rule, Lambda, EventBridge rule, IAM policies) are present in the synthesized output.

2. Deploy to AWS: Run cdk deploy. You’ll be asked to confirm IAM changes (since we’re creating roles and a Config rule that involves AWS Config service). Type “y” to approve. CDK will then create the stack in your AWS account.

• The deployment will output progress. It should create the Config rule, Lambda function (and an S3 asset upload for the code), the EventBridge rule, etc.

• Once complete, note any output or confirmations. There might not be custom outputs, but you can verify in the AWS console that the resources are now live:

• In AWS Config console, under Rules, you should see ec2-instance-managed-by-systems-manager with a green check if all instances are compliant (or a red exclamation if not).

• In the Lambda console, the function SSMAlertFunction should exist with the environment variables set.

• In EventBridge (or CloudWatch Events) console, a rule ConfigNonComplianceRule should be present targeting the Lambda.

• In IAM, the Lambda’s role should have the SES and EC2 read permissions we defined.

3. Test the functionality: Now the fun part – validate that the alert works. There are a couple of ways to test:

• Immediate Compliance Check: As soon as the Config rule is active, it will evaluate your instances. If you already have an unmanaged EC2 instance, the rule should flag it. AWS Config might periodically re-evaluate or evaluate on state change. To force a quick evaluation, you can go to the AWS Config console, select the rule, and trigger a re-evaluation (or stop/start an instance to generate a config change).

• Simulate a Non-Compliant Instance: If all your instances are compliant, you can simulate a violation. For example, take a test EC2 instance and try to break its SSM management:

• If it’s Windows or Linux with SSM agent running, you could stop the SSM agent service on that instance (for a quick test). On Amazon Linux, sudo systemctl stop amazon-ssm-agent will do it. Within a few minutes, AWS Config should detect the agent is not responding.

• Alternatively, remove the IAM instance profile from the instance (or create a new instance without an IAM role for SSM). The agent will lose access and should drop off.

• Wait for AWS Config Evaluation: AWS Config will mark the instance as NON_COMPLIANT on the next evaluation cycle or config change trigger. When that happens, EventBridge should catch it and invoke the Lambda. This is usually quite fast (near real-time on config change).

• Check for Email: Monitor the recipient inbox (and possibly spam folder, just in case) for the alert email. The email subject should start with “Alert: EC2 instance i-XXXX is NOT managed by SSM”. The body will list the instance ID and name, account, region, and advise to check the agent/role.

• Verify CloudWatch Logs: If you don’t see an email, go to CloudWatch Logs for the Lambda (/aws/lambda/SSMAlertFunction) to troubleshoot. You should see logs of the event and any errors. Common issues could be SES not sending (check that the email is verified and in correct region) or permissions.

If everything is set up correctly, you should receive the notification email shortly after an instance becomes unmanaged. Success! 🎉 You’ve got an automated watcher on your EC2 instances’ SSM status.

Conclusion

By following this guide, you have implemented an automated alert system for unmanaged EC2 instances using AWS Systems Manager and AWS CDK. This solution ensures that if any EC2 in your environment loses SSM management (whether due to agent issues or misconfiguration), you’ll promptly get an email alert with the details. The combination of AWS Config and EventBridge provides a powerful event-driven way to monitor compliance in near real-time, and using Lambda+SES gives a flexible notification mechanism.

This setup isn’t just about notifications – it’s about embracing DevSecOps principles. We’ve automated a compliance check and integrated it into your operations workflow, reducing the need for manual monitoring. Such automation helps maintain infrastructure hygiene and frees engineers to focus on proactive improvements rather than reactively putting out fires. As noted earlier, automation of compliance checks saves time and lowers risk , which is exactly what we achieved here.

You can extend this approach further by adding automatic remediation (for example, triggering a Systems Manager Automation document to reinstall the agent or notify via Slack or create a ServiceNow ticket, etc.). AWS Config Rules even support automatic remediation actions via SSM documents , which could be a next step.

In production environments, it’s highly recommended to adopt this kind of guardrail. It provides quick visibility into potential misconfigurations (like someone launching an instance without proper IAM role or an agent crash going unnoticed). With AWS CDK, you also have this infrastructure as code, making it easy to version control and deploy to multiple accounts or regions as needed.

In summary, by setting up AWS alerts for unmanaged EC2 instances, you bolster your cloud environment’s security and compliance posture. We encourage you to integrate this solution (and others like it for different compliance rules) into your environments. It’s a small investment of time that pays off with continuous, hands-off monitoring and peace of mind knowing no EC2 instance will silently drift out of management without you knowing. Happy automating and stay secure!

Goodbye Bastion, Hello Zero-Trust: Our Journey to Simplified RDS Access

Pawan Sawalani — Wed, 09 Apr 2025 11:04:26 GMT

Connecting to a private AWS database shouldn’t feel like hacking through a jungle of jump boxes and VPNs. In our team’s early days, though, that was our reality. This post is a candid look at how we improved the developer experience and security of accessing Amazon RDS databases – moving from an old-school Windows bastion (jump box) to AWS’s shiny new Verified Access, and finally landing on a surprisingly simple solution with AWS Systems Manager Session Manager. We’ll cover what worked, what didn’t, and how you can set up a smooth, secure database access workflow no matter your experience level.

Background: The Old Bastion Setup (RDP into RDS)

Not long ago, our developers accessed private RDS databases by RDP-ing into a Windows “bastion” host in AWS. This bastion was an EC2 instance in a public subnet acting as a jump box. Team members would Remote Desktop into it, then use database GUI tools (like SQL Server Management Studio or pgAdmin) installed on that bastion to connect to the actual RDS instances in private subnets. It was the traditional solution to avoid exposing databases directly, but it came with plenty of headaches:

Clunky User Experience: Engineers couldn’t use their own machines or preferred tools directly. They had to operate via a remote Windows desktop, often suffering lag and limited clipboard sharing. It felt like working through a periscope rather than directly on your workstation.
Security Risks: The bastion needed an open RDP port (3389) accessible (albeit restricted by IP). This inherently increases risk – if the security group was misconfigured or an exploit found in RDP, our private DB network could be exposed . With more remote work, the chances of someone poking a hole in the firewall for convenience grew .
Maintenance Burden: A Windows server requires constant care – OS patching, user account management, and even handling RDP license limits if multiple people use it . We had to keep the DB client software up-to-date on the bastion too. All this ops overhead for a box that didn’t do any “real” work, except letting us in.

Figure: Traditional approach using a bastion host (an EC2 jump box in a public subnet) to reach a private Amazon RDS database. Developers’ traffic goes from the corporate network (or internet) to the bastion, then onward to the database. This requires opening RDP/SSH access to the bastion, which introduces management overhead and potential security exposure.

It was clear this setup didn’t scale well for our growing team. We wanted a way to connect to RDS directly from our laptops, without that clumsy remote hop – but still keeping the databases locked down from the internet. VPN was one option, but managing a full-blown VPN client and infrastructure felt heavy. In late 2024, AWS announced something that caught our attention as a possible answer.

Trying AWS Verified Access for Direct Database Connectivity

When AWS released Verified Access (AVA), it sounded like a game-changer. AWS Verified Access is a service built on zero-trust principles that lets users connect securely to internal applications without a VPN . Initially it was only for web (HTTP) apps, but as of re:Invent 2024, it expanded to support non-HTTP endpoints – including RDS databases . The promise was VPN-less, policy-controlled access to private resources, with fine-grained checks on each connection (user identity, device security posture, etc.). For our use case, the appeal was huge:

Engineers could run their favorite database GUI directly on their laptop and connect to the RDS endpoint as if they were in the office network. No more RDP hop – better user experience and productivity.
Security would actually improve: Verified Access would evaluate every login attempt against security policies (who you are, whether your device is trusted, etc.), only then broker a connection . It’s based on “never trust, always verify” principles, meaning even if someone somehow got credentials, if they weren’t on an approved device or didn’t meet policy, access would be denied.
We could eliminate the exposed bastion entirely. Verified Access acts as a managed gatekeeper in AWS’s cloud, so no need for an open port in our VPC for RDP or SSH.

Setting up AWS Verified Access for our databases involved a few pieces. First, we needed to integrate it with our SSO identity provider (AWS IAM Identity Center in our case) as a “trust provider”. This let Verified Access confirm our engineers’ identities via SSO login. Next, we created a Verified Access instance and defined an endpoint for our RDS. AWS now allows an RDS instance (or cluster or proxy) to be a target for Verified Access . We then set up an access policy – in our test, we kept it simple: allow members of our engineering SSO group who passed MFA. Verified Access can get very granular (checking device OS, patch level, etc.), but we started basic just to get it working.

One critical component was deploying the AWS Verified Access client (also called the Connectivity Client) on our laptops . This is a small app that runs on the user’s machine to facilitate the connection. It encrypts and funnels traffic from the laptop to AWS Verified Access, including attaching the user’s identity and device info, so that AWS can decide if the traffic is allowed . In essence, it’s like a smart VPN client but application-specific and ephemeral. We installed the client, and it prompted us to log in via our SSO in a browser. Once authenticated, the client established a secure tunnel to AWS.

From a user standpoint, after launching the Verified Access client and logging in, they could open their database tool (say, DBeaver or DataGrip), and connect to the database’s endpoint (we used the regular RDS hostname) on the default port. The Verified Access client transparently routed that connection through AWS to our VPC. It really felt like magic the first time – my pgAdmin on my MacBook connected to a Postgres in a private subnet without any SSH tunnels or VPN, and with AWS handling the security behind the scenes.

Figure: AWS Verified Access.

Initial benefits we observed:

Night-and-day UX improvement: Everyone could use their own IDE/GUI, with native performance. Running queries or browsing tables was as snappy as if on a local network.
No more shared jump box: Each engineer authenticated individually via SSO. There was no single chokepoint server to maintain or that could be compromised to gain broader access – Verified Access only let that one user’s session through, and only to the specific database endpoint we configured.
Auditing and control: Verified Access logs every access request. We could enforce multi-factor auth and even device compliance (e.g., only allow up-to-date company laptops). It’s true zero-trust: every new connection is verified against policies rather than implicitly trusted once on a VPN.

The Downsides of Verified Access in Practice

This pilot with AWS Verified Access was promising, but as we dug deeper and scaled it out, we hit some challenges that made us reconsider relying on it long-term:

Client Software Limitations: Since it was a new service, the Verified Access connectivity client had a few rough edges. It was only available for Windows and Mac at first – our one engineer on Linux was out of luck . (AWS hinted Linux support was coming, but it wasn’t there yet.) Additionally, the client lacked a friendly GUI; we had to configure it by dropping a JSON config file onto the machine (no simple one-click setup) . This was manageable for our tech-savvy team, but not exactly polished.
Complexity of Policies: Writing policies in AWS Verified Access uses AWS Cedar (a policy language). It’s powerful but introduced a learning curve. Simple policies were fine, but anything custom required understanding a new syntax and debugging in a new console. For a small team, this felt like overkill just to allow database access for devs.
Cost Concerns: Perhaps the biggest factor – cost. AWS Verified Access is a managed service you pay for per application endpoint and per hour. In our case, each private RDS we wanted to enable access to counted as an application endpoint. The pricing in our region came out to about $0.27 per hour per app plus a small per-GB data charge . That means roughly $200 per month for each database. In a dev/test/prod scenario with multiple databases, we were looking at several hundreds of dollars monthly just for this convenience. Compared to a simple EC2 bastion (which might be ~$50 or less per month) it was an order of magnitude more expensive. As a startup, that was hard to justify beyond initial testing.
Operational Maturity: Being a very new service, we encountered a few hiccups – occasional client disconnects and once an identity sync issue that blocked a login until we reset the client. AWS support was helpful, but it reminded us that we were early adopters on the bleeding edge. We had to ask: did we want to be pioneers here, or use something more battle-tested?

Weighing these downsides, we decided to explore alternatives. We loved the idea of ditching the bastion and having direct access, but maybe there was a simpler way to get there without the cost and complexity of Verified Access. It turned out, the solution was something we already had at our fingertips in AWS.

Switching Gears to AWS SSM Session Manager

After our trial with Verified Access, we took a step back and reexamined the problem. We wanted secure, easy access to private RDS from our laptops, and we wanted to minimize infrastructure and maintenance. AWS actually provides a feature for secure remote access that we had used before for shell access: AWS Systems Manager Session Manager (SSM Session Manager). Could we use it for database access? The answer was yes – and it was surprisingly straightforward.

AWS Session Manager lets you open a shell or tunnel to an EC2 instance without any SSH keys or open ports, by using an SSM Agent installed on the instance . What many don’t realize is that Session Manager can also handle port forwarding. In late 2022, AWS added the ability to forward traffic not just to the instance itself, but through the instance to another host – essentially an SSH tunnel-like capability, but over the SSM channel . This is perfect for our use case: we can use a lightweight EC2 instance as a private relay to the database, and Session Manager will securely connect our laptop to that instance and pipe the traffic to the RDS.

Here’s how we built our Session Manager solution, step by step, and how it addressed our needs:

1. Setting Up a Small EC2 “Tunnel” Instance

First, we launched a tiny EC2 instance in the same VPC and private subnet as our RDS. (We jokingly call this our “bastion”, but it’s not accessible like a traditional one – no inbound access at all.) Important details for this instance:

Instance Type & OS: We chose an Amazon Linux 2 t4g.nano (very cheap, ~$4/month). Amazon Linux comes with the SSM Agent pre-installed, which saved setup time.
SSM IAM Role: We attached the AmazonSSMManagedInstanceCore IAM policy via an instance role. This grants the instance permission to communicate with the SSM service. With this, the SSM Agent on the instance can register itself and receive Session Manager connection requests. (No SSH keys needed at all – authentication will be handled by IAM and SSM.)
Security Groups: The instance’s security group was locked down. We did not allow any inbound ports from anywhere (not even SSH from our IP). We only allowed outbound traffic. Specifically, outbound rules allowed HTTPS (port 443) so the agent could reach SSM’s endpoints, and allowed outbound to the RDS’s port. The RDS’s security group in turn allowed inbound from this instance’s security group on the database port. This way, the EC2 can talk to the database internally, but nothing external can talk to the EC2.
Networking: We gave the instance access to the internet only via an SSM VPC endpoint (and a VPC endpoint for EC2 messages), instead of a NAT gateway. This is an optional step, but it means the SSM Agent traffic goes through a private VPC endpoint to AWS, which is more secure and avoids NAT data charges. (If you skip VPC endpoints, the agent will use the NAT to reach the Systems Manager API, which is fine but costs a bit more and traverses the internet.)

At this point, we had an SSM-managed instance in the private subnet. Think of it as a potential one-to-one replacement of the old bastion – except it’s not exposed to the world at all. Now we needed to actually use it to reach the database from our laptops.

2. Starting a Session Manager Port Forward

AWS provides a CLI command to open a Session Manager session. Instead of a normal shell session, we will start a port forwarding session. Here’s an example command we use (in a Bash script on our laptops) to connect to one of our PostgreSQL databases:

# Variables for clarity
INSTANCE_ID="i-0123456789abcdef0"   # The EC2 instance acting as our SSM tunnel
RDS_ENDPOINT="mydatabase.cluster-abcdefghijkl.us-east-1.rds.amazonaws.com"
DB_PORT=5432

aws ssm start-session \
  --target "$INSTANCE_ID" \
  --document-name "AWS-StartPortForwardingSessionToRemoteHost" \
  --parameters "host=$RDS_ENDPOINT,portNumber=$DB_PORT,localPortNumber=$DB_PORT"

Let’s break down what this does:

aws ssm start-session: This initiates an SSM Session Manager session from our machine. (Make sure you’ve configured your AWS CLI with credentials/SSO that have permission to use Session Manager on that instance.)
--target: The ID of the EC2 instance we launched. This tells AWS which instance’s SSM Agent should handle the session.
--document-name "AWS-StartPortForwardingSessionToRemoteHost": This is an AWS-provided session document that knows how to set up port forwarding to a specified remote host. It’s essentially a pre-built SSM action for tunneling.
--parameters "host=...,portNumber=...,localPortNumber=...": Here we provide the RDS host and port we want to reach, and which local port to use on our laptop. In our example, we set host to the RDS endpoint DNS name, portNumber to 5432 (the DB’s port), and localPortNumber also to 5432. This means the SSM Agent on the EC2 will open a connection to mydatabase...:5432 (our RDS), and forward that back through the session to localhost:5432 on our laptop .

When we run this command, a few things happen behind the scenes:

The AWS CLI calls the SSM service, which in turn signals the SSM Agent on our instance to start a port forwarding session. Because our instance can reach the RDS internally, it successfully connects to the database’s host and port.
The CLI also starts a local proxy listening on the specified localPortNumber (5432). You’ll see output like “Starting session with SessionId …” and “Port 5432 opened for session … Waiting for connections…” . This means everything is set – the tunnel is up and idle, waiting for you to connect.
We keep that terminal running (the session stays active). Now on our local machine, we can connect to localhost:5432 and it will actually reach the RDS through the tunnel.

At this point, the experience is exactly like using Verified Access (or a VPN). I can fire up my database client on my laptop, but now I point it to 127.0.0.1:5432 (or a localhost alias), with the usual database credentials. Boom – I’m connected to the private RDS. The Session Manager tunnel carries all the traffic. From the database’s perspective, it sees a connection coming from the EC2 instance’s IP (since that instance is acting as the client on its behalf). From my perspective, it feels local.

One great aspect of Session Manager is that all of this is done using my AWS IAM credentials. If I’m authenticated with AWS (for example via AWS SSO login or access keys), I don’t need to juggle any SSH keys or bastion passwords. Permissions to use Session Manager can be controlled via IAM policies (for instance, only allow certain IAM roles to start sessions to that instance). And every session is logged in AWS CloudTrail (and even Session Manager can be set to log full console output to S3/CloudWatch if needed). So we gained auditability without much effort – an improvement over the old bastion where RDP logins were somewhat opaque.

Figure: Using AWS Systems Manager Session Manager to create a secure tunnel from a client to an RDS database via a private EC2 instance. The EC2 “bastion” lives in a private subnet with no inbound ports open. The Session Manager agent on it connects out to AWS, allowing authorized users to start an encrypted session. This lets us forward a local port on our laptop to the remote database securely.

Cost impact: Remember the cost comparison that motivated us? Here’s how it played out:

The Session Manager approach requires a small EC2 instance running 24/7. Our t4g.nano plus storage costs about $5 per month. We could even stop it out of hours, but at that price it’s not worth the hassle.
Session Manager itself doesn’t cost extra; it’s a feature of AWS Systems Manager. There is no hourly charge for sessions, and data transfer is minimal (just the database traffic which we’d have anyway; it might incur tiny charges if it goes through a NAT or VPC endpoint, but those are pennies).
Versus Verified Access, which would have been around $0.27/hour each for our databases (≈$200/month per DB) , the savings are enormous. Even factoring in the old Windows bastion cost (say ~$50/month), Session Manager is an order of magnitude cheaper. Essentially, we got nearly the same functionality for almost no cost in our AWS bill.

3. Smoothing the Workflow (Making it Easy for Engineers)

Running a long CLI command to start the tunnel was fine for us, but we wanted to make this as seamless as possible – especially for new engineers who might not be AWS CLI wizards. We took a couple of steps to streamline usage on our laptops:

Bash Script & Alias: We wrapped the aws ssm start-session command in a simple shell script (connect-db.sh) and put it in our team’s internal toolkit repository. It accepts the environment or database name as an argument, so it knows which instance and host to target. For example: connect-db.sh prod reporting-db would fetch the appropriate instance ID and DB host from a config and run the above command. Developers can alias this in their shell, so bringing up the tunnel is one short command away. Each script execution opens a new terminal window with the session (so we remember to close it when done).
Auto-Connect on macOS (Launch Agent): For those frequently connecting to a dev database, we created a Launch Agent on macOS to automatically start the tunnel at login. This uses a .plist file in ~/Library/LaunchAgents. Here’s a snippet of what that looks like:


<plist version="1.0">
  <dict>
    <key>Labelkey>
    <string>com.mycompany.ssm-tunnelstring>
    <key>ProgramArgumentskey>
    <array>
      <string>/usr/local/bin/awsstring>
      <string>ssmstring>
      <string>start-sessionstring>
      <string>--targetstring>
      <string>i-0123456789abcdef0string>
      <string>--document-namestring>
      <string>AWS-StartPortForwardingSessionToRemoteHoststring>
      <string>--parametersstring>
      <string>host=mydatabase.cluster-abcdefghijkl.us-east-1.rds.amazonaws.com,portNumber=5432,localPortNumber=5432string>
    array>
    <key>RunAtLoadkey><true/>
    <key>KeepAlivekey><true/>
    <key>StandardOutPathkey><string>/tmp/ssm-tunnel.logstring>
    <key>StandardErrorPathkey><string>/tmp/ssm-tunnel.errstring>
  dict>
plist>

In plain English, this Launch Agent definition does: when I log in, run the AWS CLI session manager command with the given parameters. The RunAtLoad means start it automatically, and KeepAlive means if it crashes or the session drops, launchd will restart it. We log output to /tmp for debugging. After loading this (launchctl load -w ~/Library/LaunchAgents/com.mycompany.ssm-tunnel.plist), the developer gets a persistent tunnel in the background. They can now connect to the DB anytime without even thinking about the tunnel – it’s just there. (We set KeepAlive so that if the session times out after inactivity, it will try to reconnect. One caveat: Session Manager sessions do have a max duration, a few hours, so the agent will reconnect a few times a day in the background.)
Using SSH Config (alternate method, which we used eventually): Another neat trick is to use the SSH client as a wrapper for Session Manager. This might sound odd since we said “no SSH”, but it’s just leveraging the SSH command as a convenient way to manage tunnels. By adding an entry in ~/.ssh/config that calls the Session Manager proxy, one can bring up a tunnel with a simple ssh invocation. For example:

Host rds-tunnel
  HostName i-0123456789abcdef0
  User ec2-user
  ProxyCommand aws ssm start-session --target %h --document-name AWS-StartPortForwardingSessionToRemoteHost --parameters "host=mydatabase.cluster-abcdefghijkl.us-east-1.rds.amazonaws.com,portNumber=5432,localPortNumber=5432"

With such an entry, running ssh -N rds-tunnel will trigger the AWS CLI to start the session (the %h is replaced with the instance ID as HostName). The -N flag tells SSH not to execute a remote command (since we aren’t actually going to log in; we just want the tunnel). This is a bit of a hack and still requires the AWS CLI, but some GUI tools can invoke SSH tunnels this way as well.

4. Results: A Happy Team with Secure Access

Once we rolled out the Session Manager solution, feedback from the team was very positive. It achieved what we wanted:

Greatly improved UX: Just like with Verified Access, engineers use their local tools and don’t have to maintain a remote VM workspace. Whether it’s a newbie using a point-and-click SQL client or a veteran automating a psql script, they run it from their machine as if the database were local. Onboarding a new engineer to access the DB is as simple as: “Install AWS CLI (or our helper script), run this command, and you’re in.”
Tight Security (no more open holes): We completely shut down the old bastion. No RDP, no SSH – nothing is exposed. The EC2 instance is invisible from the internet. Session Manager uses an encrypted TLS connection initiated from the inside, and requires the user to auth with AWS. This removed a major attack surface. As AWS’s own best practices note, Session Manager eliminates the need for bastion hosts or open inbound ports . We also benefit from audit logs; we can see which user opened a session at what time in AWS CloudTrail, and even log the I/O if we wanted to inspect what commands are run (for shell sessions).
Low Maintenance: The EC2 tunnel instance is about as low-touch as it gets. Amazon Linux 2 applies security patches on reboot easily; we can also bake an AMI periodically with updates if we ever needed to replace it. The SSM Agent updates itself automatically via AWS Systems Manager. There are no user accounts or keys on this instance to manage. In fact, the instance runs with no human login at all. If we want to administer it, we’d use Session Manager to get a shell. This dramatically reduces the admin overhead compared to the old Windows bastion that needed active user management and patching. And unlike Verified Access, there’s no separate client software for us to deploy to everyone – just the ubiquitous AWS CLI.
Cost Savings: We already calculated the stark difference – on the order of maybe $10/month vs $200-$600/month for our scenario. Over a year, that’s thousands saved, which matters for our budget. We’re effectively paying only for a tiny instance and using an AWS service that’s free (covered by the fact we use AWS in general). For larger orgs, the cost argument might be different, but for us this was a huge win.
Room for Expansion: With this setup, if we add more databases or even other internal services (e.g., an ElastiCache Redis, or an internal HTTP service), we have options. We can either use the same EC2 as a multi-purpose tunnel (starting separate sessions for different targets as needed), or create more instances if we want isolation per environment. Since it’s so cheap, spinning up one per environment or per service is not an issue. Session Manager even allows tunneling RDP or SSH if we ever needed GUI or console access to an instance – it’s versatile.

Conclusion: Lessons Learned and Tips

Our journey from a clunky bastion to a modern access solution taught us a few valuable lessons:

“New and shiny” isn’t always “better for us.” AWS Verified Access is a powerful service and no doubt the future for many zero-trust network scenarios. If we had strict device compliance requirements or a larger enterprise setup, its policy-based access and deep integration with corporate identity might be worth the cost. But in our case, the simpler Session Manager approach covered 90% of our needs at a fraction of the complexity and cost. It was a reminder that tried-and-true tools can sometimes beat bleeding-edge solutions, depending on the context.
User experience matters, but balance it with security and cost. We were determined to improve UX for our engineers, and we did – moving away from the old jump box improved quality of life significantly. However, we had to also consider security (ensuring the new solution wasn’t trading one risk for another) and cost. We found a sweet spot where UX, security, and cost were all satisfied. Whenever you introduce a new access method, evaluate it holistically: how will users feel about it, is it actually secure, and does it justify the expense?
AWS Session Manager is underrated. Many engineers know Session Manager as “that thing you can use instead of SSH to get a console”. But its port forwarding capability is a game-changer for scenarios like database access. It enabled us to implement a lightweight bastion-as-a-service without maintaining complex infrastructure. If you’re still using old bastion hosts or SSH tunnels, give Session Manager a serious look – it can simplify your life. As AWS’s security blog notes, Session Manager can eliminate the need for bastions and open ports while still giving you necessary access .
Automation makes perfect. Once you set up a solution like this, invest a bit of time to script it and integrate it into your team’s workflows. Our use of Launch Agents and simple CLI wrappers means nobody is fumbling with long commands or forgetting to start their tunnel. New hires get a smooth experience from day one (“Just run this script and you’re connected”). Little quality-of-life improvements go a long way in adoption of a new tool.

In the end, our team now connects to our databases securely, quickly, and with minimal fuss. We retired the fragile old Windows jump box and significantly cut down our attack surface. We also saved money, which is always a nice bonus. And when AWS improves Verified Access (maybe a fully managed client, Linux support, lower costs?), we’ll be ready to re-evaluate it. But for now, Session Manager has become our go-to solution for remote access to cloud resources.

If you’re in a similar boat – juggling bastions, VPNs, or pondering AWS Verified Access – I hope our story helps you find the approach that works best for you. Sometimes the solution is hiding in plain sight (in our case, in the AWS CLI we were already using). Happy connecting, and may all your database queries be speedy and secure!

Why Security-Conscious Startups Need DevSecOps from Day One

Pawan Sawalani — Wed, 26 Mar 2025 22:41:51 GMT

I vividly remember a night early in my career when I jolted awake at 3 AM to a barrage of Slack notifications. Our company’s app was under attack – a database had been left exposed, and an attacker was starting to poke around our customer data. We were lucky to catch it in time, but that sleepless night was a wake-up call for me. It drove home a lesson I carry to this day: if you’re a security-conscious startup (and you should be), you need to embed security practices from day one. In other words, you need DevSecOps at the core of your startup’s DNA from the very beginning.

My name is Pawan. As a Lead DevSecOps Engineer at Prommt with 8+ years in AWS, platform engineering, and cloud security, I’ve seen firsthand how proactive security can make or break a young company. In this post, I want to share why integrating DevSecOps early isn’t a “nice-to-have” but a must-have for startups — and how it’s helped me build safer, faster-moving teams. I’ll also sprinkle in some personal war stories (including how we shaping DevSecOps workflow at Prommt) to illustrate the impact. So grab a coffee, and let’s dive in.

The Startup Security Blind Spot

Startups thrive on moving fast and innovating. “Move fast and break things,” right? The problem is, what breaks isn’t always just your code – it could be your security. In the rush to ship features and impress investors, it’s easy for a small team to push off security tasks. I’ve heard founders say, “We’ll worry about security after we get our first 100k users,” only to realize (sometimes painfully) that a single security mishap can stop them from ever getting to that milestone.

The truth is, even early-stage startups are juicy targets for cyber attackers. You might think, “We’re too small, who would target us?” But attackers often see startups as low-hanging fruit – less likely to have strong defenses, but still holding valuable data. A single breach can wreck your reputation and destroy user trust overnight. It might even scare off potential investors or partners. (As someone who’s had to field panicked investor calls after a security incident, I can tell you that nothing derails a promising pitch faster than news of a breach.)

So what’s the solution? It’s not to slow down or become risk-averse – it’s to build security into your fast-paced workflow. That’s exactly what DevSecOps is about. By weaving security into development and operations from day one, you mitigate that blind spot. Instead of a mad scramble to patch holes after an incident, you’re preventing those holes from appearing in the first place. Think of it as vaccinating your product against common exploits and misconfigurations. You still move fast, but you’re not flying blind.

DevSecOps on a Startup Budget: Working Smarter with Automation

One pushback I often hear is, “We’re just a tiny startup – we can’t afford a full security team or expensive tools yet.” The good news is, you don’t need a dedicated army of security engineers to start practicing DevSecOps. In fact, adopting DevSecOps early can save your startup time and money in the long run by catching issues early (when they’re cheapest to fix) and by reducing the likelihood of costly breaches.

DevSecOps isn’t about buying fancy appliances; it’s a mindset and a set of practices. For a startup, that means working smarter with the resources you have. Here are a few practical ways we’ve implemented DevSecOps on a lean budget:

• Automate Repetitive Checks: Use open-source or built-in tools to automate security tests in your CI/CD (Continuous Integration/Continuous Deployment) pipeline. For example, you can run static code analysis to catch vulnerable code, dependency scans to find risky libraries, and secret scanners to ensure no API keys slip into your repos. These can run on every code commit or pull request. In one of my previous roles, after we set up automated dependency scanning, we caught and fixed dozens of vulnerabilities before they ever made it to production – saving us from potential exploits down the road.

• Leverage Cloud Security Features: If you’re on AWS (as many startups are), take advantage of the free or low-cost security features from day one. AWS has services like GuardDuty (for threat detection), Config (for policy compliance checks), and CloudTrail (for auditing), to name a few. Turning these on early creates a safety net. I like to set up simple alerts for things like “unexpected public S3 bucket” or unusual login locations. It’s amazing how a few well-placed guardrails can prevent the classic rookie mistakes (like accidentally leaving storage buckets open to the world).

• Infrastructure as Code & Safe Defaults: Define your infrastructure in code and put it in version control. By doing this, you can enforce best practices (like secure configurations) by default. For instance, our team uses Terraform to spin up cloud resources with security built-in – only necessary ports open, proper encryption enabled, least-privilege access roles, etc. Developers don’t have to think about those details because the infrastructure code and templates handle it. This way, security isn’t a one-off effort; it’s consistently applied every time we deploy.

The key is automation and consistency. As a startup, you want your small team focused on building the product, not doing tedious manual security reviews for each release. DevSecOps lets you automate those reviews. It’s like having a tireless security assistant who checks everything in the background while your team concentrates on core work.

By investing a bit of time to set up these automated checks and processes early, you avoid much bigger headaches (and costs) later. Trust me, spending an afternoon to script a security test is way better than spending a week doing damage control after a hack.

Security Is a Team Sport: Building a Security-First Culture

Tools and automation are fantastic, but DevSecOps is more than just tools – it’s about culture. One of the biggest benefits of adopting DevSecOps early is the ability to instill a security-first culture from the get-go. In a startup, culture is formed quickly by the founding team’s values and habits. If security is part of that DNA, it becomes a shared responsibility rather than “someone else’s problem.”

In practical terms, a security-first culture means developers, ops, and even product folks all recognize that security is part of their job. It doesn’t mean everyone becomes a security expert, but it does mean everyone cares about doing the right thing. Here are a few ways to nurture this culture:

• Lead by Example: As a tech leader or founder, your attitude towards security sets the tone. If you treat security as an important, enabling part of building software (e.g. asking “How can we make this feature secure by design?”) rather than as a hindrance, the team will follow. In my teams, I make a point to celebrate when someone proactively finds and fixes a security issue. It’s as worthy of praise as shipping a new feature.

• Empower and Educate: Give your team the knowledge and tools to act on security. At Prommt, we started “DevSync“, where Developer, Infra and ops get together and discuss/demo any new security enhancements and how it can be used in our web-app to improve the security even more. That knowledge paid off when we later caught a potential DDoS attack before it went live, just because we embedded the security enhancement discussed/demoed in our DevSync.

• Integrate, Don’t Silo: Avoid the old-school model of a separate security team that only swoops in at the end. In a startup, you probably don’t even have a separate security team – and that’s okay. Make security part of your development process. If you do have a security specialist (like me at Prommt), embed them in the design and sprint cycles rather than keeping them on the sidelines. When devs and ops see security folks as partners rather than gatekeepers, they’re more likely to engage us early to get things right.

By fostering this collaborative mindset early, you reduce friction later on. I’ve seen the contrast: in one organization that embraced DevSecOps from day one, developers would literally call me over to double-check their approach to storing passwords or to brainstorm the safest way to implement a feature. In another company that bolted on security much later, every security fix felt like pulling teeth – developers were defensive and saw it as “extra work.” The difference was night and day, and it all came down to culture.

Plus, a security-first culture is something you can proudly share with investors and customers. It shows that your startup is mature and trustworthy beyond its years. In the long run, that reputation can be as valuable as any feature on your roadmap.

Our DevSecOps Journey at Prommt: Integrating AWS Security Services into Our CI/CD Pipeline

Let me share a concrete example from my current role. When I joined Prommt as the Lead DevSecOps Engineer, we were a rapidly scaling startup in the payments industry. Given our responsibility for handling sensitive financial transactions, we understood immediately that security had to be integral, not an afterthought. Early on, we made deliberate choices to embed robust security measures like AWS WAF, Inspector, AWS Landing Zone, and GuardDuty directly into our development lifecycle through our CI/CD processes.

What does this integration look like in practice? Here are a few highlights:

• AWS Landing Zone as a Secure Foundation: We adopted AWS Landing Zone to establish a secure, multi-account AWS environment right from the start. This provided us a standardized baseline for managing account governance, access controls, and compliance across the organization, reducing operational overhead and security risks significantly.

• Security Embedded CI/CD Pipelines: Every code commit automatically triggers our CI/CD pipeline, which includes security checks leveraging SonarQube, Automated Security Helper (ASH) and AWS Inspector. If critical issues are detected, deployments are halted immediately, providing developers clear and actionable remediation guidance. Catching vulnerabilities early means fewer risks make it into production.

• Real-time Threat Detection with AWS GuardDuty: We integrated AWS GuardDuty into our platform to provide continuous threat detection and monitoring. GuardDuty analyzes logs and network activities across our AWS accounts, automatically identifying suspicious activities such as unauthorized access attempts or unusual API calls. This proactive approach helps us quickly pinpoint and address security incidents before they escalate.

• Web Application Firewall (AWS WAF): To protect our web-facing services, we implemented AWS WAF at the platform level. WAF provides robust protection against common web exploits such as SQL injection and cross-site scripting (XSS). By integrating WAF rules directly into our deployment process, each new web application or microservice is immediately shielded from known threats, significantly reducing our attack surface.

Integrating these AWS security tools delivered substantial benefits:

• Streamlined Security Operations: With AWS Landing Zone and GuardDuty in place, we've significantly cut down on manual security management tasks. Our security team spends less time firefighting and more time improving proactive security measures.

• Enhanced Developer Confidence: Embedding security directly within CI/CD pipelines reduces the stress associated with deployments. Developers appreciate the rapid feedback provided by SonarQube, Automated Security Helper (ASH), AWS Inspector and GuardDuty, knowing that potential issues are flagged immediately, allowing them to address security as part of their normal workflow.

• Simplified Compliance Audits: As a payments-focused startup regularly audited for compliance (PCI-DSS, GDPR), the standardization offered by AWS Landing Zone and the continuous protection of GuardDuty and WAF makes audit processes smoother and less disruptive. Our security practices are transparent and consistently enforced, satisfying auditors and stakeholders alike.

Perhaps most importantly, integrating AWS security services has significantly boosted team morale. Developers at Prommt now spend less time worrying about hidden vulnerabilities or security misconfigurations and more time delivering secure, quality features to our customers. Security has genuinely become an empowering aspect of their everyday workflow.

The Prommt story is just one way to implement DevSecOps early, but it shows that even a small team can build powerful safeguards into their workflow. The key takeaway is to embed security and automation into the fabric of your development process. Whether it’s a full-blown IDP or simply a well-tuned CI/CD pipeline with security gates, that integration will pay dividends as you scale.

The Payoff: Trust, Speed, and Peace of Mind

Embracing DevSecOps from day one isn’t just about avoiding disasters – it can actually fuel your startup’s growth. When security is baked in early, you’re not constantly putting out fires. You can ship features faster and more confidently because the team isn’t bogged down by last-minute security scrambles or emergency patching.

There’s also a clear business upside. Customers and investors might not see all the under-the-hood work, but they donotice the outcomes: your product is reliable, there’s no news of data leaks, and you can speak confidently about your security practices. This builds trust. I’ve been in due diligence meetings where a startup’s security posture was a deciding factor for an investment. Being able to say “Yes, we have automated security testing, infrastructure guardrails, and a security-aware culture from day one” can literally secure the deal (pun intended). In a crowded market, a strong security story sets you apart – it signals that you’re not just moving fast, you’re moving fast and safe.

Finally, consider the human factor: peace of mind. Launching a startup is stressful enough without wondering if today is the day you’ll get hacked or accidentally expose user data. Knowing that you’ve put a DevSecOps foundation in place – even if it’s not perfect – helps everyone sleep a little better at night. I can attest that it’s a great feeling when weeks go by without any 3 AM security emergencies. Your on-call engineers will thank you when they’re not waking up to critical pager alerts every other week.

Conclusion

Security doesn’t have to be the enemy of innovation. In fact, when done right, it’s a catalyst for sustainable innovation. By adopting DevSecOps from day one, you’re investing in your startup’s long-term resilience. The payoff comes in many forms: fewer security incidents, faster delivery, happier developers, and greater trust from users and investors.

If you’re a startup founder or engineer, here are a few actionable takeaways to get started:

1. Start Small, But Start Now: Pick one or two security practices and integrate them into your development process this week. For example, enable an automated dependency scan in your build, or add a step in code review to check for basic security issues. You don’t have to do everything at once – the important part is to begin.

2. Automate the Basics: Identify common security “gotchas” (like leaked credentials, open admin ports, or outdated libraries) and use scripts or tools to check for them continuously. Set up alerts for the critical ones. Early automation yields big benefits with minimal effort.

3. Educate & Involve the Team: Share at least one security tip or lesson in your next team meeting. Encourage questions about security when designing features. Maybe even host a casual threat modeling session over pizza. Make security a normal part of the conversation, not a taboo topic.

4. Leverage the Community and Tools: You’re not alone in this. There are plenty of free resources, open-source tools, and communities (DevSecOps forums, blogs, etc.) where you can learn tips tailored for startups. Don’t reinvent the wheel – borrow proven ideas and adapt them to your needs.

In the end, DevSecOps is about building a company that can move fast and confidently. I’ve been on both sides – the frantic firefighting mode and the smooth, secure delivery mode – and I can’t recommend the proactive approach enough. Security-conscious startups set themselves up for success by treating DevSecOps as a day-one priority.

Thanks for reading! If you have questions or want to share your own startup security stories, drop a comment – I’d love to hear your experiences.

Pawan's Tech Blog

Migrating GitLab Runners from EKS Fargate to EKS Auto Mode: A 40% Cost Reduction Journey

The Problem: Fargate Costs Were Adding Up

Why Fargate Becomes Expensive for CI/CD

Why Run GitLab Runners on Kubernetes (EKS)?

Option 1: EC2 Instances (Docker/Shell Executor)

Option 2: ECS with Fargate or EC2

Option 3: EKS with Kubernetes Executor (Our Choice)

Why EKS Won for Us

Why EKS Auto Mode with Spot Over Standard EKS with Fargate?

EKS with Fargate: The Serverless Promise

EKS Auto Mode with Spot: Managed Karpenter

Cost Comparison: Real Numbers

When to Choose Each

The Solution: EKS Auto Mode with Spot Instances

Why CDK Instead of CLI Commands?

1. Reproducibility

2. Multi-Environment Consistency

3. Dependency Management

4. Safer Updates

5. Integration with Existing Infrastructure

The Implementation

Cluster Configuration

NodePool for Spot Instances

GitLab Runner Helm Chart

The Results

Lessons Learned (The Hard Way)

Lesson 1: Understanding Karpenter Disruption Budgets

Lesson 2: Add the do-not-disrupt Annotation

Lesson 3: Avoid Burstable Instances for CI/CD Job Pods

Lesson 4: Node Labels Must Match Job Selectors

Lesson 5: Diverse Instance Types for Spot Availability

Drawbacks and Considerations

1. Spot Interruptions

2. Cold Start Latency

3. Increased Complexity

4. Cost Visibility

5. Shared Node Security

When to Stick with Fargate

Conclusion

Resources

I'm Now an AWS Community Builder — Here's What That Means

What is the AWS Community Builders Program?

Why I Applied

What I Plan to Share

Containerisation Journeys

Security-First Cloud Architecture

AI-Powered DevSecOps

Kubernetes in Production

Looking Ahead

Level Up Your Cloud Security: My Playbook for DevSecOps Acceleration with AWS LZA

Introduction: The Quest for Secure and Agile Cloud Operations

The Multi-Account Tightrope: Why Managing AWS at Scale Needs a Safety Net

Enter AWS Landing Zone Accelerator: Our Foundation for Secure Innovation

LZA: The DevSecOps Supercharger

A. Shifting Security Left, Effortlessly

B. Automation as a Force Multiplier

C. Security as Code in Practice

D. Centralized Governance that Empowers, Not Restricts

Our LZA Journey: Key Wins and Real-World Impact

Getting Your Hands on LZA: It's More Accessible Than You Think

Conclusion: Building a Secure, Agile Future with LZA-Powered DevSecOps

Automating AWS EC2 CloudWatch Agent Monitoring & Email Alerting with Lambda and CDK

Why CloudWatch Agent Monitoring is Important for EC2

Why You Need Alerts for Missing or Stopped Agents

Building an Automated CloudWatch Agent Status Checker (Lambda + SSM)

Discovering EC2 Instances via SSM

Checking CloudWatch Agent Status with SSM Run Command

Retrieving and Parsing the Command Results

Formatting the Markdown Alert and Sending Email via SES

Step-by-Step Deployment with AWS CDK (TypeScript)

Conclusion

Alerting for AWS EC2 Instances Not Managed by SSM Using AWS Config and CDK

Why Ensure EC2 Instances Are Managed by SSM

How EC2 Instances Become “Unmanaged”

How can we detect unmanaged instances automatically?

Solution Overview and Architecture

Prerequisites

Step 1: Initialize a New AWS CDK Project

Step 2: Define the AWS Config Rule (SSM Managed Instances)

Lesson 2: Add the `do-not-disrupt` Annotation