Architecture, AWS, Cloud Native, DevOps, GitOps, kubernetes, MLOps

Production‑Ready GPU Inference Autoscaling on EKS with Karpenter, KEDA, and Dragonfly

TL;DR This architecture uses Karpenter + KEDA + Dragonfly on EKS to scale GPU inference pods from zero, pull model images quicker, and cut GPU spend with spot-first provisioning. Cold starts are 84s; warm starts are 7s (with small image). Everything is GitOps-driven via ArgoCD and fully reproducible with Terraform.

If you’re tired of paying for GPU nodes that sit idle half the day, or waiting minutes for cold starts when traffic suddenly spikes, this guide is for you.

Most teams running GPU inference on Kubernetes eventually hit the same wall:

  • GPUs are expensive
  • Traffic is spiky
  • Cold starts are painful
  • Large model images make everything worse
  • Scaling is either too slow or too costly
  • GitOps workflows often break when autoscaling enters the picture

This architecture solves all of that.

It gives you:

  • Scale‑to‑zero when idle
  • Fast burst capacity when demand arrives
  • Predictable cost with spot‑first provisioning
  • Minimal cold‑start pain, even with 8–40 GB model images
  • A clean GitOps + IaC workflow
  • Compatibility with both small and large GPU workloads

It works whether you run a single g4dn.xlarge or a fleet of A100s.

All code lives in the companion repo: 👉 https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling

Why This Architecture Exists

GPU nodes are expensive. A single p4d.24xlarge costs ~$32/hr on‑demand.

If your traffic is:

  • quiet at night
  • busy during the day
  • unpredictable during peak hours

…then paying for always‑on GPU capacity is pure waste.

But scaling to zero introduces its own problems:

  • GPU node provisioning
  • Large image pulls
  • Model weight loading
  • CUDA initialisation
  • Slow ECR bandwidth
  • Cold-start latency that can stretch into minutes

This architecture eliminates the false choice between:

Always‑on GPU waste

Cold‑start disasters

and gives you:

Elastic GPU capacity

Predictable cost

Fast warm starts

Production‑grade reliability

It’s built from components that each solve a specific pain point:

  • Karpenter → fast, flexible GPU node provisioning
  • KEDA / Knative → pod‑level autoscaling for async + HTTP workloads
  • Dragonfly → P2P image distribution that eliminates ECR bottlenecks
  • NVIDIA Device Plugin + GFD → predictable GPU scheduling
  • ArgoCD → GitOps-driven reproducibility
  • Terraform → infrastructure as code

Together, they give you GPU elasticity without the cost burn.

Who This Architecture Is For

Small workloads (1 GPU)

  • You want scale‑to‑zero
  • You want predictable cold‑start behaviour
  • You want to avoid paying for idle GPUs
  • You want a clean GitOps workflow

Large workloads (10–100+ GPUs)

  • You need fast burst capacity
  • You need to avoid ECR bottlenecks
  • You need spot‑first provisioning with fallback
  • You need observability‑driven autoscaling

Common real‑world fits

  • LLM inference APIs
  • Async video/image processing
  • Model evaluation pipelines
  • Embedding services
  • Chatbots and RAG systems

Benchmark Results + Load Test Summary

Before diving into the architecture, let’s answer the only question that really matters:

Does this actually work?

Yes and the numbers are the payoff for everything that follows.

These results come from two instrumented runs of validate-scaling.sh on a real EKS 1.35 cluster in eu‑west‑2, using T4 GPUs (g4dn.xlarge spot).

Cold vs Warm Start (Dragonfly P2P Cache + existing node)

Note: a 130 MB stub image was used for the tests.

EventCold StartWarm Start (Dragonfly P2P)
Queue push → node Ready37s3s
Queue push → pods serving84s7s
Scale‑to‑zero307s285s

The 84s cold start breaks into two phases, EC2 launch (~37s) and ECR image pull (~47s), covered in detail in the Scaling Lifecycle section below.

Warm start 84s → 7s (12× faster)

Scale‑to‑Zero Behaviour

Scale‑in is dominated by KEDA’s cooldownPeriod:

  • Default: 300s
  • Observed: 285–307s

After cooldown:

  • Karpenter’s whenEmpty + consolidateAfter drains the node
  • EC2 instance terminates cleanly

This matches the configured behaviour almost exactly.

Load Test Summary

A 90‑second load test with 5 concurrent workers at fixed 150ms processing time produced:

  • Throughput: 602–607 requests
  • p50 latency: ~175ms (150ms processing + ~25ms HTTP overhead)
  • GPU: NVIDIA T4, 15,360 MiB VRAM
  • Dragonfly mirror: PASS (warm run served entirely from P2P cache)
  • Validation score: 24/25 checks passed
    • The single FAIL was a consolidation timing edge case in the test script, not an architecture issue

Trade-offs Accepted

  • 84s cold start is the cost of full scale-to-zero
  • Dragonfly adds operational complexity
  • pollingInterval: 30 means up to 30s of lag before scale-out
  • Spot-first means occasional interruptions
  • ECR image-per-model means large images, no hot-swapping weights

Architecture Overview

This architecture is intentionally simple to understand at a glance, yet flexible enough to support everything from a single‑GPU dev cluster to a 100‑GPU production fleet.

It’s built around three principles:

  1. System pods should never depend on burst GPU capacity
  2. Baseline GPU capacity (if any) should be predictable and stable
  3. Burst GPU capacity should scale to zero and back without friction

Here’s how that looks in practice.

A clean, production‑ready AWS layout:

VPC (10.0.0.0/16)
├── Public subnets
│ └── NAT Gateway, Internet Gateway
└── Private subnets
└── EKS nodes, VPC endpoints (ECR, S3, STS)

Node Groups & NodePools

The cluster uses a three‑tier node strategy:

1. Always‑On System Node Group (Managed Node Group)

Runs:

  • CoreDNS
  • Karpenter
  • KEDA
  • Knative control plane
  • Prometheus
  • Dragonfly seed nodes
  • ArgoCD
  • Kyverno
  • External Secrets

Default: t3.medium (cheap, predictable)

2. Baseline GPU Node Group (Optional)

Used in production when you want:

  • predictable minimum GPU capacity
  • warm image caches
  • stable throughput during business hours

Default: 0 GPUs in dev, 4 GPUs in prod.

3. Karpenter NodePools (Burst Capacity)

Two NodePools:

  • general → CPU-only spot nodes
  • gpu → GPU spot nodes with on‑demand fallback

GPU NodePool supports:

  • g4dn.xlarge, g4dn.2xlarge
  • g5.xlarge, g5.2xlarge
  • Spot-first, on-demand fallback
  • GPU taints to prevent CPU workloads from landing on GPU nodes

This separation is critical:

Without separate NodePools, Karpenter may place a CPU-only workload on a GPU node, wasting expensive capacity.

Addons (Platform Layer)

All installed via ArgoCD:

  • Karpenter v1.9.0 – node autoprovisioner
  • NVIDIA Device Plugin v0.19.1 – exposes nvidia.com/gpu
  • GPU Feature Discovery – labels GPU nodes with VRAM, product, CUDA version
  • Knative Serving v1.21.1 – HTTP autoscaling
  • Kourier v1.21.0 – Knative networking
  • KEDA v2.19.0 – async autoscaling
  • kube-prometheus-stack v82.10.4 – metrics + Grafana
  • Dragonfly v1.6.15 – P2P image distribution
  • DCGM Exporter – GPU metrics for Prometheus
  • Kyverno – policies
  • External Secrets Operator – secret management

Everything is declarative, GitOps-driven, and reproducible.

High-Level Architecture Diagram

Why This Architecture Works

  • System pods are isolated from GPU churn
  • GPU workloads only land on GPU nodes
  • GPU nodes scale to zero cleanly
  • Dragonfly eliminates ECR bottlenecks
  • Karpenter provisions nodes fast
  • KEDA/Knative scale pods predictably
  • GitOps keeps everything reproducible

Scaling Lifecycle (End‑to‑End)

To tune this architecture effectively, you need to understand how the components interact from the moment a job arrives to the moment the GPU node disappears.

This is the mental model that makes everything else click:

KEDA is the signal (“we need more pods”). Karpenter is the muscle (“here are the nodes to run them”).

Below is the complete end‑to‑end flow, including timings from real benchmark runs.

Spot Interruption Handling

Spot interruptions are handled cleanly:

  • AWS sends a 2‑minute warning to an SQS queue
  • Karpenter listens to that queue
  • When a termination notice arrives:
    • Karpenter cordons the node
    • Drains all pods
    • Schedules replacements on other nodes
    • Terminates the instance gracefully

This ensures:

  • GPU jobs in flight when the notice arrives have up to 2 minutes to complete*
  • Pods are drained gracefully before the instance terminates
  • Replacements are scheduled on other nodes before termination

Spot-first is safe because the architecture is interruption-aware.

*The deployment sets terminationGracePeriodSeconds: 120, matching AWS’s 2-minute spot interruption notice, this gives the pod the full window to finish in-flight work before Kubernetes hard-kills it. Whether jobs are actually preserved depends on the inference server handling SIGTERM: it needs to stop accepting new requests and wait for in-flight work to complete. The stub doesn’t implement this (it’s not a real inference server). For production workloads, add a SIGTERM handler or a preStop hook to drain the queue.

Timings (Cold vs Warm)

From real benchmark runs on g4dn.xlarge spot, eu-west-2:

  • Node Ready: 37s cold → 3s warm
  • Pods Serving: 84s cold → 7s warm
  • Scale-to-zero: 285–307s (matches KEDA cooldown)

Cold starts break down into two phases:

  1. EC2 launch (~37s) – the unavoidable floor: AWS booting a GPU instance, AL2023 initialising, and the NVIDIA device plugin advertising nvidia.com/gpu
  2. Image pull (~47s) – even a 130 MB stub image takes ~47s pulled directly from ECR on a cold node; for real inference images (8–40 GB) this phase dominates

The warm start eliminates both. The GPU node is already running, no provisioning needed. Dragonfly has already seeded the P2P cache from the cold pull, so the ECR registry isn’t contacted again; layers are served peer-to-peer in under a second.

Dragonfly is the difference between:

  • 47s ECR pull – cold node, image fetched directly from ECR
  • <5s P2P delivery – image served from Dragonfly peer cache; ECR is never contacted

In production, “warm” is a deliberate choice: setting consolidateAfter: 2h on the GPU NodePool keeps nodes alive between traffic peaks, preserving both the warm node and the Dragonfly cache. In dev, consolidation is 3 minutes, so the warm state is transient.

Why This Lifecycle Matters

This lifecycle is the backbone of the entire architecture. It explains why:

  • KEDA and Karpenter complement each other
  • Dragonfly is mandatory for large images
  • GPU nodes can scale to zero safely
  • Cold starts are predictable
  • Warm starts are extremely fast
  • Spot-first provisioning is viable
  • GitOps remains stable even under churn

Once you understand this flow, the rest of the architecture becomes intuitive.

Key Design Decisions

These are the decisions that determine whether your GPU autoscaling architecture will be fast, predictable, and cost‑efficient or painful, flaky, and expensive.

Each choice below is deliberate, battle‑tested, and grounded in real‑world constraints.

1. Karpenter vs Cluster Autoscaler vs EKS Auto Mode

FeatureKarpenterCluster AutoscalerEKS Auto Mode
SpeedFastestSlowMedium
GPU flexibilityHighMediumLow
Cost controlExcellentGoodPoor
Custom AMIsYesYesNo
Spot-firstYesYesYes
Scale-to-zeroYesYesPartial
Supports mixed GPU fleetsYesYesNo

Verdict

If you care about GPU cost, flexibility, and scale‑to‑zero, Karpenter wins.

Why

Karpenter provisions nodes directly via EC2 RunInstances / CreateFleet, bypassing ASGs entirely. This gives you:

  • faster provisioning
  • more instance types
  • spot-first with fallback
  • custom AMIs
  • GPU-aware scheduling
  • consolidation (automatic scale-in)

EKS Auto Mode is simpler but:

  • costs more
  • hides configuration
  • doesn’t support custom AMIs
  • doesn’t support DRA
  • doesn’t support mixed GPU fleets

Cluster Autoscaler is stable but slow and ASG-bound.

2. GPU Resource Allocation: Device Plugin vs DRA

This single choice determines what autoscaling strategies are even possible.

Device Plugin (what this architecture uses)

The NVIDIA device plugin exposes nvidia.com/gpu as a schedulable resource:

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

It’s:

  • simple
  • predictable
  • stable
  • widely supported
  • compatible with Karpenter

Dynamic Resource Allocation (DRA)

DRA is the future of GPU scheduling in Kubernetes.

It introduces ResourceClaims with CEL expressions for:

  • fractional GPUs
  • MIG profiles
  • time‑slicing
  • MPS
  • Blackwell hardware (mandatory)
  • structured parameters (added in Kubernetes 1.31)

But today:

Karpenter does not support DRA. (Tracked in kubernetes-sigs/karpenter#1231, no published ETA.)

If a pod uses resourceClaims, Karpenter:

  • ignores it entirely
  • cannot provision nodes for it
  • leaves pods stuck in Pending

EKS Auto Mode has the same limitation.

If you need DRA today

You must use:

  • Cluster Autoscaler + managed node groups
  • static GPU capacity
  • no scale-to-zero

Quick comparison

FeatureDevice PluginDRA
Allocation granularityWhole GPUFractional, MIG, time-slice
Karpenter compatibleYesNo
EKS Auto Mode compatibleYesNo
Cluster Autoscaler compatibleYesYes
GPU sharingUniformPer-pod flexible
Blackwell / P6e‑GB200Not supportedRequired
Best forBursty inference, Karpenter scalingMulti-tenant sharing, MIG, Blackwell

Future-proofing

Once DRA reaches GA and Karpenter supports it, you can:

  • swap the device plugin for a DRA driver
  • keep your pod specs the same
  • gain fractional GPUs + MIG + time-slicing

This architecture is designed to evolve cleanly when that day comes.

3. Always Set GPU Limits on the NodePool

This is a critical safety guardrail.

Karpenter’s default limits cover CPU and memory, not GPUs.

Without:

limits:
  nvidia.com/gpu: "8"

…a misconfigured job can trigger unbounded GPU node provisioning.

The repo includes this limit by default.

4. GPU Node Labeling with GPU Feature Discovery (GFD)

When a GPU node joins the cluster, you need more than just nvidia.com/gpu: "1".

You need to know:

  • VRAM
  • GPU model
  • CUDA version
  • GPU count
  • Architecture

GFD annotates nodes with labels like:

nvidia.com/gpu.memory=40Gi
nvidia.com/gpu.product=A100-SXM4-40GB
nvidia.com/gpu.count=8
nvidia.com/gpu.cuda.version=12.2

5. Dragonfly vs. ECR-only pulls

Dragonfly turns image distribution from:

O(N) pulls from ECR

into

O(1) pull + P2P fan-out

For large images (8–40 GB), this is the difference between:

  • 15–30 minutes cold start (with a real 8–40 GB image, not the 130 MB stub)
  • seconds warm start

Dragonfly is mandatory for:

  • large model images
  • burst scaling
  • multi-node inference fleets
  • cost-efficient ECR usage

6. Why Separate CPU and GPU NodePools

If you don’t separate them:

  • Karpenter may place CPU-only workloads on GPU nodes
  • GPU nodes become expensive general-purpose nodes
  • GPU capacity becomes unpredictable
  • Consolidation becomes less effective

The GPU NodePool carries a taint:

gpu=true:NoSchedule

Only pods that explicitly tolerate it can land on GPU nodes.

This guarantees:

  • no accidental GPU waste
  • predictable GPU scheduling
  • clean scale-to-zero behaviour

7. Why ECR + Dragonfly Instead of EFS for Model Storage

Some teams store model weights on EFS and mount them into pods.

This architecture intentionally avoids that.

Why?

  • EFS is slower than local NVMe
  • EFS adds per-GB and per-IOPS cost
  • EFS introduces cold-start latency
  • EFS is a single point of contention
  • ECR + Dragonfly is faster, cheaper, and scales better

Packaging model weights inside container images gives you:

  • versioned, immutable artifacts
  • reproducible deployments
  • compatibility with GitOps
  • P2P distribution via Dragonfly
  • no shared filesystem bottlenecks

8. MIG, Time-Slicing, and Multi-Tenant GPU Sharing

This architecture does not use MIG or time-slicing today because:

  • They require DRA when using Kubernetes-native MIG scheduling
  • Karpenter does not support DRA
  • GPU Operator introduces complexity not needed for burst inference

But the architecture is designed to evolve:

  • GFD already exposes MIG-capable metadata
  • NodePools can be extended with MIG profiles
  • DRA migration will be seamless once supported

Autoscaling Components (KEDA + Knative)

GPU autoscaling isn’t just about provisioning nodes, it’s about scaling pods in a way that matches your workload pattern.

This architecture supports two autoscaling modes, each optimized for a different inference pattern:

  • KEDA → async, queue-driven, batch, event-driven
  • Knative → HTTP, interactive, low-latency

They can coexist in the same cluster without interfering with each other.

Let’s break them down.

KEDA – Autoscaling for Async Workloads

KEDA is ideal when inference is queue-driven:

  • SQS
  • Kafka
  • Redis
  • RabbitMQ
  • Prometheus metrics
  • Cron triggers
  • Custom metrics

Why KEDA?

Async workloads have a natural buffer (the queue), which makes them:

  • bursty
  • unpredictable
  • latency-tolerant
  • throughput-oriented

KEDA shines here because it:

  • scales from 0 → N based on queue depth
  • supports Prometheus as a trigger
  • integrates cleanly with Karpenter
  • works with any container runtime
  • is simple, predictable, and transparent

Why Prometheus Instead of SQS?

You can scale directly from SQS, but Prometheus is better for GPU inference.

Prometheus gives you:

  • source-agnostic metrics (you can switch queues without changing autoscaling logic)
  • richer semantics (queue depth, processing time, GPU utilization, backlog age)
  • unified observability (everything in one place: Grafana, alerts, dashboards)
  • better control (you can scale on derived metrics, not just raw queue length)

Example: Scale on gpu_job_queue_depth instead of raw SQS message count.

Example KEDA ScaledObject (Simplified)

minReplicaCount: 0
pollingInterval: 30
cooldownPeriod: 300
triggers:
  - type: prometheus
    metadata:
      metricName: gpu_job_queue_depth
      threshold: "5"

Key behaviours

  • minReplicaCount: 0scale-to-zero
  • pollingInterval: 30 → polls Prometheus every 30 seconds
  • cooldownPeriod: 300 → prevents flapping
  • Prometheus trigger → flexible, observable, debuggable

This is the async backbone of the architecture.

Note: Set pollingInterval: 15 if you need tighter scaling response, the tradeoff is double the Prometheus query rate. With pollingInterval: 30, a job arriving just after a poll can wait 30s before KEDA acts. 37s node + 47s image + 30s KEDA lag = ~114s worst case.

Knative – Autoscaling for HTTP Workloads

Knative is ideal when inference is HTTP-driven:

  • chatbots
  • LLM APIs
  • embedding endpoints
  • image/video processing APIs
  • interactive services

Why Knative?

Knative gives you:

  • HTTP autoscaling
  • concurrency-based scaling
  • scale-to-zero
  • built-in activator
  • request buffering
  • automatic cold-start mitigation

Knative is effectively “serverless for Kubernetes,” but with GPU support.

Concurrency Model (Important)

Knative scales based on concurrency, not CPU or memory.

Example:

  • containerConcurrency: 1 → 1 request per pod
  • containerConcurrency: 10 → 10 requests per pod

For GPU inference, you typically want:

  • 1–2 for large models
  • 4–8 for small models
  • 10+ for embedding workloads

This gives you predictable latency and throughput.

The Activator (Cold-Start Mitigation)

When a Knative service is scaled to zero:

  • the activator receives the first request
  • buffers it
  • triggers scale-up
  • forwards the request once the pod is ready

This prevents:

  • 404s
  • connection resets
  • client timeouts

It’s one of the reasons Knative is so good for GPU inference APIs.

KEDA vs Knative – When to Use Which

PatternUse
Queue-drivenKEDA
Batch jobsKEDA
Async pipelinesKEDA
HTTP APIsKnative
Interactive inferenceKnative
Chatbots / RAGKnative
Mixed workloadsBoth

Can they coexist?

Yes – and they do in this architecture.

KEDA handles async pipelines. Knative handles HTTP inference. Both scale pods. Karpenter provisions nodes for both.

Why Not Use Only One?

Because:

  • KEDA is not optimized for HTTP
  • Knative is not optimized for queues
  • Both have different scaling semantics
  • Both solve different cold-start problems
  • Both integrate cleanly with Karpenter

This architecture uses the right tool for each job.

Dragonfly + containerd Wiring (AL2023, certs.d, hosts.toml)

Dragonfly is the component that turns GPU autoscaling from “theoretically possible” into “practically fast.” But Dragonfly only works if containerd is wired correctly and on AL2023, the defaults changed in a way that breaks most online guides.

This section explains:

  • why AL2023 ignores inline registry config
  • why you must use certs.d
  • how the DaemonSet writes hosts.toml
  • how Dragonfly intercepts all ECR pulls
  • what happens when it doesn’t

This is one of the most important operational details in the entire system.

The AL2023 containerd Change (Critical)

EKS 1.35 uses containerd 1.7.x on AL2023.

This version ignores inline registry configuration inside /etc/containerd/config.toml.

Instead, it uses:

/etc/containerd/certs.d/<registry>/hosts.toml

This means:

  • inline registry mirrors do nothing
  • Dragonfly will not intercept pulls
  • containerd silently falls back to ECR
  • you lose all P2P benefits

This is the #1 cause of “Dragonfly isn’t working” reports.

Correct Wiring: certs.d + hosts.toml

To ensure Dragonfly intercepts all ECR pulls, you must create:

/etc/containerd/certs.d/<account>.dkr.ecr.<region>.amazonaws.com/hosts.toml

With contents like:

server = "https://<registry>"

[host."http://127.0.0.1:4001"]
  capabilities = ["pull", "resolve"]

This tells containerd:

  • “When pulling from ECR, talk to Dragonfly first.”
  • “Dragonfly will fetch from ECR if needed.”
  • “Otherwise, use the P2P cache.”

DaemonSet: Writing hosts.toml Automatically

The repo includes a DaemonSet that:

  • runs on every node (including GPU nodes)
  • detects the correct ECR registry hostname
  • writes the appropriate hosts.toml file
  • restarts containerd if needed
  • validates the configuration

This ensures:

  • every node routes pulls through Dragonfly
  • no manual configuration
  • no drift
  • no surprises during scale-out

Validating Dragonfly Wiring

The repo includes validation checks in validate-scaling.sh:

  • confirms Dragonfly mirror is active
  • confirms containerd is using hosts.toml
  • confirms warm pulls come from P2P cache
  • confirms no ECR bandwidth is used

A successful warm run shows:

Dragonfly mirror: PASS warm run served entirely from P2P cache

If you see this, you’re good.

Why Not Use ECR Pull-Through Cache Instead?

Because:

  • it still requires O(N) pulls
  • it does not provide P2P fan-out
  • it does not reduce cold-start time
  • it does not reduce ECR throttling
  • it does not help with large images
  • it does not reduce NAT Gateway costs

Dragonfly solves all of these.

GitOps with ArgoCD (App‑of‑Apps Pattern)

Autoscaling GPU infrastructure is only half the story. The other half is making sure the entire platform, Karpenter, KEDA, Knative, Dragonfly, NVIDIA plugin, workloads, policies is deployed consistently, declaratively, and without manual kubectl.

That’s where ArgoCD comes in.

This architecture uses a clean, production‑grade App‑of‑Apps pattern that:

  • deploys the entire cluster from a single root manifest
  • separates platform from workloads
  • enforces ordering via sync waves
  • self‑heals drift
  • integrates cleanly with Terraform
  • keeps everything GitOps‑driven

Let’s break it down.

The App‑of‑Apps Model

At the top level, there is a single ArgoCD Application:

argocd/app-of-apps.yaml

This root application points ArgoCD at the argocd/ directory, which contains four ApplicationSets:

  1. platform-helm
  2. platform-kustomize
  3. security-infra
  4. apps

Each ApplicationSet deploys a logical slice of the platform.

This gives you:

  • clean separation of concerns
  • predictable ordering
  • reproducible environments
  • zero manual installation steps

The Four ApplicationSets

1. platform-helm

Deploys all Helm‑based platform components:

  • Karpenter
  • KEDA
  • Prometheus + Grafana
  • Kyverno
  • OpenCost
  • Dragonfly
  • NVIDIA device plugin
  • DCGM Exporter
  • External Secrets Operator
  • Pushgateway

These are the operators and services that form the backbone of the cluster.

2. platform-kustomize

Deploys all Kustomize‑based platform components:

  • Knative Serving
  • Kourier
  • Karpenter NodePool + EC2NodeClass
  • Kyverno policies
  • External Secrets ClusterSecretStore
  • Grafana admin ExternalSecret
  • Prometheus alerting rules

This layer configures the platform operators installed by platform-helm.

3. security-infra

Deploys foundational security primitives:

  • Namespaces
  • Pod Security Standards (PSS)
  • NetworkPolicies (default deny)
  • ResourceQuotas
  • PriorityClasses
  • PodDisruptionBudgets

This ensures the cluster is secure before any operators or workloads are deployed.

4. apps

Deploys the actual inference workloads:

  • KEDA Deployment + ScaledObject
  • Knative Service
  • Any additional inference pipelines

This layer depends on all CRDs and operators being ready.

Sync Waves (Critical for First Install)

ArgoCD sync waves ensure everything is applied in the correct order.

Wave 0 – security-infra

Creates namespaces + PSS labels. Operators cannot start without these.

Wave 1 – platform-helm + platform-kustomize

Installs operators and their CRDs. Internal sub-waves ensure:

  • CRDs are installed before CRD-dependent manifests
  • NodePools apply only after Karpenter is ready
  • Kyverno policies apply only after Kyverno is running
  • External Secrets apply only after ESO is ready

Wave 2 – apps

Deploys inference workloads only after:

  • KEDA is ready
  • Knative is ready
  • Karpenter is ready
  • Dragonfly is ready
  • NVIDIA plugin is ready
  • GPU nodes can be provisioned

This prevents race conditions and broken first installs.

GitOps Workflow

Once the root Application is applied:

kubectl apply -f argocd/app-of-apps.yaml

ArgoCD takes over:

  • installs all platform components
  • installs all workloads
  • self-heals drift
  • re-syncs on every push to main
  • enforces declarative state

No more:

  • manual helm installs
  • kubectl apply
  • one-off scripts
  • snowflake clusters

Everything is GitOps-driven.

Terraform + Infrastructure (AWS Layer)

Everything in this architecture sits on top of a clean, predictable AWS foundation. Terraform owns the infrastructure layer, VPC, EKS, IAM, node groups, Pod Identity, and VPC endpoints, while ArgoCD owns the Kubernetes layer.

This separation gives you:

  • reproducible clusters
  • safe upgrades
  • clean GitOps workflows
  • zero manual configuration
  • predictable GPU autoscaling behaviour

Let’s break down the key infrastructure components and why they matter.

VPC Endpoints (Critical for GPU Autoscaling)

GPU autoscaling stresses ECR and STS more than almost any other workload. Without VPC endpoints, every image pull and every Pod Identity credential request goes through the NAT Gateway.

That means:

  • slow image pulls
  • NAT bottlenecks
  • surprise NAT bills
  • unpredictable cold-start behaviour

This architecture provisions four endpoints:

1. ecr.api (Interface Endpoint)

Authenticates ECR pulls.

2. ecr.dkr (Interface Endpoint)

Transfers image layer manifests.

3. s3 (Gateway Endpoint)

ECR stores image layers in S3, this routes traffic over the AWS backbone.

4. sts (Interface Endpoint)

Required for EKS Pod Identity credential resolution.

Why these matter

With these endpoints:

  • image pulls stay inside the VPC
  • Dragonfly can seed the P2P cache faster
  • Pod Identity works without NAT
  • cold starts become predictable
  • NAT data transfer costs for ECR/S3/STS drop to near-zero

This is one of the most important infrastructure optimizations in the entire architecture.

EKS Pod Identity (Not IRSA)

This architecture uses EKS Pod Identity, not IRSA, for:

  • Karpenter
  • EBS CSI
  • any AWS-integrated components

Why Pod Identity?

  • no OIDC thumbprint management
  • no service account annotations
  • no IAM role trust policy complexity
  • no race conditions during bootstrap
  • simpler, safer, cleaner

Terraform provisions:

  • the IAM roles
  • the Pod Identity associations
  • the eks-pod-identity-agent addon
  • the STS VPC endpoint

This ensures:

  • Karpenter can call EC2 APIs
  • EBS CSI can provision volumes
  • workloads can access AWS services
  • no NAT Gateway dependency

IAM Requirements for Karpenter

Karpenter needs permissions for:

  • ec2:RunInstances / CreateFleet
  • instance type discovery
  • spot price discovery
  • AMI resolution via SSM
  • IAM instance profile management
  • pricing API access (pricing:GetProducts)

Missing any one of these causes a different failure mode:

  • nodes stuck in NotReady
  • nodes failing to join
  • pods stuck in Pending
  • Karpenter provisioning loops
  • silent failures

The repo includes the full, correct IAM policy.

Node Join Authorization

Karpenter-provisioned nodes must be authorized to join the cluster.

With terraform-aws-modules/eks v21.x:

  • this is handled automatically via EKS access entries

If you’re on an older version:

  • you must manually update aws-auth
  • otherwise nodes will register but remain NotReady

This is a common pitfall in DIY setups, but fully automated here.

GPU AMIs (AL2023)

This architecture uses the AL2023 GPU AMI, which includes:

  • NVIDIA drivers pre-installed
  • containerd 1.7.x
  • correct kernel modules
  • correct GPU runtime configuration

But it does not include:

  • the NVIDIA device plugin
  • GPU Feature Discovery
  • DCGM Exporter

Those are installed via ArgoCD.

Why AL2023?

  • faster boot times
  • better GPU driver stability
  • better containerd performance
  • better security posture
  • long-term support

DCGM Exporter (GPU Metrics)

DCGM Exporter provides:

  • GPU utilization
  • GPU memory usage
  • temperature
  • power draw
  • ECC errors

These metrics feed:

  • Prometheus
  • Grafana dashboards
  • autoscaling decisions (if desired)
  • alerting rules

It schedules only on GPU nodes via:

  • node affinity
  • GPU taints
  • tolerations

This ensures clean separation between CPU and GPU observability.

Terraform Owns the Infrastructure, ArgoCD Owns the Platform

This separation is intentional:

Terraform owns:

  • VPC
  • Subnets
  • NAT Gateway
  • VPC endpoints
  • EKS cluster
  • Node groups
  • IAM roles
  • Pod Identity
  • EBS CSI
  • Karpenter IAM
  • Security groups

ArgoCD owns:

  • Karpenter
  • KEDA
  • Knative
  • Dragonfly
  • NVIDIA plugin
  • DCGM Exporter
  • Prometheus
  • Kyverno
  • External Secrets
  • Workloads

This gives you:

  • clean separation of concerns
  • safe upgrades
  • reproducible clusters
  • GitOps-driven platform
  • no manual configuration drift

With it, the architecture becomes fast, predictable, and production-ready.

GPU VRAM Sizing (Quick Reference)

Choosing the right GPU instance type is mostly about VRAM. Model size (in parameters) determines how much VRAM you need for inference, especially in fp16.

This table gives you a fast, practical reference for common AWS GPU instances.

GPU VRAM Sizing Table

Instance TypeGPUVRAMMax Model Size (fp16)
g4dn.xlarge1× T416 GB~7B params
g5.xlarge1× A10G24 GB~13B params
g5.12xlarge4× A10G96 GB~70B params
p4d.24xlarge8× A100320 GB~180B params (tensor parallel)

Notes

  • fp16 models require roughly 2 bytes per parameter, plus overhead.
  • Larger models (30B–180B) require tensor parallelism across multiple GPUs.
  • For vLLM, throughput scales with both VRAM and memory bandwidth.
  • For Triton, VRAM determines batch size and concurrency.
  • For embedding models, VRAM requirements are much lower, often 4–8 GB is enough.

Node Autoscaling (Karpenter Good Practices)

Karpenter is the engine that makes GPU autoscaling fast, flexible, and cost‑efficient but only if you configure it correctly. GPU workloads amplify every mistake: a misconfigured NodePool can cost thousands per month or cause multi‑minute cold starts.

This section distills the best practices that make GPU autoscaling predictable, safe, and cheap.

1. Spot‑First Provisioning with On‑Demand Fallback

Spot GPUs are dramatically cheaper:

  • g4dn.xlarge: ~70% cheaper
  • g5.xlarge: ~60% cheaper
  • p4d: often 50% cheaper

But spot capacity is not guaranteed.

The correct pattern is:

  1. Try spot first
  2. Fallback to on‑demand if spot unavailable
  3. Never block workloads waiting for spot

This is implemented via:

requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]

Karpenter automatically:

  • attempts spot
  • falls back to on‑demand
  • retries spot later during consolidation

This gives you the best of both worlds:

  • low cost
  • high reliability

2. Separate CPU and GPU NodePools

This is mandatory for predictable GPU autoscaling.

Why?

If CPU workloads can land on GPU nodes:

  • GPU nodes become expensive general-purpose nodes
  • consolidation becomes ineffective
  • GPU capacity becomes unpredictable
  • scale-to-zero breaks
  • Karpenter may refuse to terminate GPU nodes

The fix:

  • GPU NodePool has a taint: gpu=true:NoSchedule
  • GPU workloads have a matching toleration
  • CPU workloads do not tolerate it

This guarantees:

  • GPU nodes only run GPU workloads
  • CPU workloads never waste GPU capacity
  • GPU nodes can scale to zero cleanly

3. Set GPU Limits on the NodePool (Critical)

Without GPU limits, Karpenter may over‑provision GPU nodes.

Example:

limits:
  nvidia.com/gpu: "8"

This prevents:

  • runaway provisioning
  • unexpected multi‑GPU nodes
  • cost explosions
  • scheduling mismatches

This is one of the most important safety guardrails in the entire architecture.

4. Use Multiple GPU Instance Types (Flexibility = Speed)

Allowing multiple GPU types dramatically improves provisioning speed.

Example:

requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
      - g4dn.xlarge
      - g4dn.2xlarge
      - g5.xlarge
      - g5.2xlarge

5. Consolidation Windows (Scale‑In Behaviour)

Karpenter’s consolidation logic determines when GPU nodes are terminated.

Recommended settings:

  • dev: consolidateAfter: 3m
  • prod: consolidateAfter: 2h

Why?

  • dev: fast feedback, low cost
  • prod: avoid flapping during traffic spikes

Consolidation only triggers when:

  • no pods are running
  • no pods are pending
  • no pods are terminating
  • no pods are using local storage

This ensures safe, predictable scale‑in.

6. Use whenEmpty: delete for GPU NodePools

This ensures GPU nodes terminate as soon as they’re empty.

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 3m

7. Use GFD Labels for VRAM‑Aware Scheduling

GPU Feature Discovery labels nodes with VRAM:

nvidia.com/gpu.memory=16Gi

This allows you to schedule workloads based on VRAM requirements:

nodeSelector:
  nvidia.com/gpu.memory: "24Gi"

This is essential when mixing:

  • T4 (16 GB)
  • A10G (24 GB)
  • A100 (40–80 GB)

It prevents:

  • OOM errors
  • wasted capacity
  • unpredictable scheduling

8. Prefer Smaller GPU Nodes for Bursty Workloads

For bursty inference workloads:

  • g4dn.xlarge (1× T4)
  • g5.xlarge (1× A10G)

are ideal.

Why?

  • faster provisioning
  • faster warm starts
  • better bin-packing
  • less fragmentation
  • easier scale-to-zero

Large multi‑GPU nodes (A100, H100) are best for:

  • large LLMs
  • tensor parallelism
  • high-throughput batch inference

But they are slower to provision and harder to scale elastically.

9. Don’t Use DaemonSets on GPU Nodes (Unless Safe)

Most DaemonSets block scale-to-zero.

If you must run a DaemonSet on GPU nodes:

  • use tolerations
  • use node affinity
  • use nodeSelector
  • ensure it’s lightweight

Examples that are safe:

  • DCGM Exporter
  • Node Problem Detector

Examples that break scale-to-zero:

  • logging agents
  • service meshes
  • sidecar-heavy stacks

This architecture avoids all of those on GPU nodes.

10. Use Pod Anti‑Affinity for Multi‑GPU Workloads

If you run multiple GPU pods per node, use anti‑affinity to avoid:

  • GPU contention
  • VRAM fragmentation
  • unpredictable latency

Example:

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: inference
      topologyKey: kubernetes.io/hostname

This ensures one GPU pod per node unless explicitly desired.

Security Hardening (Practical, Production‑Ready Defaults)

GPU inference clusters often start as “just get it running,” but once you move toward production, you need guardrails. This architecture includes a minimal, sane, production‑ready security baseline that doesn’t get in your way but protects you from the most common failure modes.

Everything here is deployed via ArgoCD in the security-infra ApplicationSet.

Let’s walk through the components.

1. Pod Security Standards (PSS) – Namespaces Enforced by Default

Namespaces are labeled individually based on what runs in them, there is no one-size-fits-all PSS level for a cluster that includes system DaemonSets, GPU drivers, and application workloads:

NamespaceenforceauditwarnReason
platformrestrictedbaselinebaselineInternal services; no privilege needed
inferencebaselinebaselinebaselineNo privilege needed; baseline avoids over-restricting debugging
monitoringprivilegedbaselinebaselineDCGM Exporter and node-exporter require host device access
external-secretsbaselinebaselinebaselineNo privilege needed
kube-system(unlabelled)NVIDIA device plugin requires privileged: true; EKS leaves unenforced by default

Inference pods themselves run without elevated privileges, the NVIDIA device plugin, which does require host-level access, runs in kube-system where PSS is not enforced.

2. NetworkPolicies – Default Deny Everywhere

Every namespace gets a default deny NetworkPolicy:

  • no pod-to-pod traffic unless explicitly allowed
  • no pod-to-node traffic unless allowed
  • no cross-namespace traffic unless allowed

3. ResourceQuotas – Prevent Runaway Workloads

Each namespace gets a ResourceQuota to prevent:

  • runaway pod creation
  • runaway PVC creation
  • runaway GPU requests
  • accidental cluster exhaustion

Example:

limits.cpu: 20
limits.memory: 64Gi
requests.nvidia.com/gpu: 4

This ensures:

  • workloads cannot consume all GPUs
  • workloads cannot starve system components
  • misconfigurations cannot take down the cluster

4. PriorityClasses – System Pods Always Win

Two PriorityClasses are defined:

system-critical

Used by:

  • Karpenter
  • KEDA
  • Knative control plane
  • Prometheus
  • Dragonfly seed nodes
  • External Secrets
  • Kyverno

workload-default

Used by inference workloads.

5. PodDisruptionBudgets – Protect Critical Components

PDBs are applied to:

  • Karpenter
  • KEDA
  • Knative activator
  • Prometheus
  • Dragonfly seed nodes

6. Kyverno Policies – Guardrails, Not Handcuffs

Kyverno audits:

  • require-gpu-request – pods must declare nvidia.com/gpu in resource limits
  • require-gpu-toleration – pods must tolerate the gpu=true:NoSchedule taint
  • require-pdb-for-gpu-pods – pods must carry the app label so a PodDisruptionBudget can select them

Why Kyverno?

  • simpler than OPA Gatekeeper
  • easier to reason about
  • integrates cleanly with GitOps
  • policies are YAML, not Rego
  • great for platform teams

Kyverno is deployed via ArgoCD and configured in the platform-kustomize layer.

7. External Secrets Operator – No Secrets in Git

Secrets come from:

  • AWS Secrets Manager
  • AWS SSM Parameter Store

Not from Git.

8. No DaemonSets on GPU Nodes (Except DCGM Exporter)

Most DaemonSets block scale-to-zero, but Karpenter’s WhenEmpty policy defines “empty” as no non-DaemonSet pods remaining. DaemonSet pods like DCGM Exporter are excluded from that check, so they don’t prevent GPU node consolidation.

This means the rule is: keep DaemonSets off GPU nodes unless they serve a GPU-specific purpose.

This architecture ensures:

  • logging agents run only on system nodes
  • service meshes run only on system nodes
  • monitoring agents run only on system nodes
  • only DCGM Exporter runs on GPU nodes, and the benchmark confirms GPU nodes still consolidate cleanly (285–307s scale-in) with it running.

This keeps GPU nodes ephemeral and cheap.

9. Ingress Hardening (Knative + Kourier)

Knative + Kourier is configured with:

  • HTTPS termination
  • mTLS between components
  • strict routing
  • no wildcard hosts
  • no public access unless explicitly enabled

This protects inference APIs from:

  • accidental exposure
  • cross-namespace routing
  • misconfigured hostnames

Gotchas (Common Pitfalls & How to Avoid Them)

Even with a clean architecture, GPU autoscaling has sharp edges. These are the issues that most commonly break scale‑to‑zero, slow down cold starts, or cause unpredictable behaviour.

This section is a checklist, if something isn’t working, start here.

A. Dragonfly Not Intercepting Pulls

This is the #1 cause of slow warm starts.

Fixes:

  • Ensure registry hostname matches exactly
  • Ensure hosts.toml exists under certs.d
  • Ensure Dragonfly proxy service is reachable
  • Ensure containerd restarted after config change

If warm starts take >10 seconds, this is the culprit.

B. Karpenter Provisioning Issues

Common causes:

  • Missing EC2 IAM permissions
  • Pod Identity not bound
  • NodePool missing GPU instance types
  • Using DRA instead of device plugin

If pods are Pending with no nodes created, start here.

C. Pods Stuck in Pending

Usually caused by scheduling constraints:

  • Missing GPU toleration
  • Wrong VRAM selector
  • Wrong instance type selector
  • DaemonSet occupying GPU nodes

If pods are Pending after nodes are created, this is the section to check.

D. Slow Cold Starts

Cold node starts should be ~30–40 seconds.

If they’re >60 seconds:

  • VPC endpoints missing
  • Dragonfly not intercepting pulls
  • Large images pulled directly from ECR
  • Knative activator buffering

This is almost always a networking or image distribution issue.

E. Scale-to-Zero Not Working

GPU nodes should terminate cleanly after cooldown.

If they don’t:

  • KEDA cooldown too long
  • DaemonSets blocking drains
  • PDBs too strict
  • Sidecars preventing termination

This is the most common cause of “GPU nodes never scale in.”

F. Unpredictable GPU Costs

If GPU costs spike unexpectedly:

  • CPU workloads landed on GPU nodes
  • GPU NodePool missing limits
  • Spot-only with no fallback
  • Consolidation disabled

This is why the architecture enforces taints, tolerations, and limits.

Getting Started (Quickstart)

This architecture is fully reproducible. You can deploy the entire stack, VPC, EKS, Karpenter, KEDA, Knative, Dragonfly, NVIDIA plugin, workloads, using Terraform + ArgoCD with no manual steps.

You can deploy the full architecture in a few steps.

For complete instructions, see the repo’s README.

1. Clone the repo

git clone https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling
cd eks-gpu-inference-autoscaling

2. Run the setup script

This injects your GitHub org, environment, and repo paths into the ArgoCD manifests.

./scripts/setup.sh --env dev

It updates:

  • argocd/app-of-apps.yaml
  • argocd/applicationset-platform-kustomize.yaml
  • argocd/applicationset-security.yaml
  • argocd/applicationset-apps.yaml

…and validates the structure before applying.

3. Install ArgoCD

kubectl create namespace argocd
kubectl apply -n argocd \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

Note: For production, install ArgoCD via its Helm chart with pinned chart version and custom values (admin password, RBAC, SSO). The raw manifest install above is sufficient for this demo but should not be used as-is in a shared cluster.

4. Apply the root Application

ArgoCD deploys the entire platform automatically:

kubectl apply -f argocd/app-of-apps.yaml

5. Validate scaling

Once the cluster is healthy:

./validate-scaling.sh

For full installation details, configuration options, and environment-specific guidance, see the repo.

Conclusion

GPU autoscaling isn’t magic, it’s engineering. And when you put the right pieces together, the results speak for themselves:

  • Cold starts become predictable
  • Warm starts become fast
  • GPU nodes scale to zero cleanly
  • Costs stay under control
  • The platform stays stable under load
  • Everything is reproducible through GitOps

This architecture works because every layer reinforces the others:

  • Terraform gives you a clean, secure AWS foundation
  • ArgoCD turns the cluster into a self‑healing system
  • Karpenter provisions GPU nodes quickly and cheaply
  • KEDA and Knative scale workloads intelligently
  • Dragonfly eliminates ECR bottlenecks
  • NVIDIA’s device plugin + GFD make GPU scheduling predictable
  • Security guardrails keep everything safe without slowing you down

The end result is a platform that can run anything from a single‑GPU dev workload to a bursty, multi‑GPU production fleet, without manual intervention, without snowflake clusters, and without surprise bills.

If you want to explore the code, deploy the architecture, or adapt it to your own workloads, the full repository is here: 👉https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling

This is a foundation you can build on, for LLMs, embeddings, vision models, batch inference, streaming pipelines, or whatever comes next.

References and Further Reading

.NET, Architecture, Cloud Native, Dapr Series, Platform Engineering

Bonus: Using Dapr with .NET Aspire for a Modern Local Development Experience

Dapr keeps infrastructure concerns out of your application code. Aspire keeps local orchestration out of your head. Put them together and you get one of the smoothest .NET development loops available today.

This bonus post is optional and .NET‑specific, but if you’re a .NET engineer, it’s worth your time. It shows how Aspire and Dapr complement each other, how they fit into a clean local development workflow, and how to wire them together in a real project.

This bonus post uses .NET Aspire 13.1 and Dapr 1.16, the current stable versions at the time of writing. Aspire 13.1 provides the new distributed application model, automatic dashboard integration, and first‑class Dapr sidecar orchestration. Dapr 1.16 provides the building blocks (state, pub/sub, bindings, secrets, observability) used throughout the examples in this repository, including the code to demonstrate running with Aspire.

How Aspire and Dapr Work Together

What Each Provides

Dapr provides infrastructure building blocks:

  • State management
  • Pub/sub
  • Bindings
  • Secrets
  • Observability
  • Service invocation

Aspire provides local application orchestration:

  • Service discovery
  • Configuration
  • Containerized dependencies
  • A unified dashboard
  • A single command to run your entire app

Once you understand what each tool provides individually, the next step is seeing how they work together in a real application.

How They Fit Together

Aspire defines your application topology.
Dapr injects infrastructure capabilities into each service.

The workflow looks like this:

  1. Aspire AppHost defines your services.
  2. Dapr sidecars run next to each service.
  3. Dapr components are loaded automatically by each sidecar when ResourcesPaths is configured.
  4. Your services use the Dapr client without caring about ports or infrastructure.

This results in a clean, local environment without the usual orchestration overhead.

Together, Aspire and Dapr give you a unified way to run multiple services, sidecars, and infrastructure components with minimal configuration.

Now that we’ve covered how Aspire and Dapr complement each other conceptually, let’s look at a minimal example that wires them together in a real project.

A Minimal Aspire + Dapr Example

Folder structure

src/
  orderservice-dotnet/
  inventoryservice-dotnet/
  aspirehost/
components/

The components/ folder contains standard Dapr component YAML files (state store, pub/sub, etc.). Aspire mounts this folder automatically so the sidecars can load them. These YAML files match the Dapr building block examples used throughout the repository, so the same components work whether you run via Aspire or the Dapr CLI.

Step 1: Add the Dapr hosting package

In your AppHost project:

dotnet add package CommunityToolkit.Aspire.Hosting.Dapr

Step 2: Define your services with Dapr sidecars

Aspire generates strongly typed project references under the Projects namespace (e.g., Projects.orderservice_dotnet). These types are created automatically when the AppHost and services are part of the same solution.

With the project structure in place, the AppHost simply wires each service to a Dapr sidecar and mounts the shared components folder.

AppHost.cs:

using CommunityToolkit.Aspire.Hosting.Dapr;

var builder = DistributedApplication.CreateBuilder(args);

builder.AddProject<Projects.orderservice_dotnet>("ordersservice")
    .WithDaprSidecar(new DaprSidecarOptions
    {
        AppId = "orderservice",
        ResourcesPaths = [Path.Combine("../..", "components")]
    });

builder.AddProject<Projects.inventoryservice_dotnet>("inventoryservice")
    .WithDaprSidecar(new DaprSidecarOptions
    {
        AppId = "inventoryservice",
        ResourcesPaths = [Path.Combine("../..", "components")]
    });


builder.Build().Run();

What this gives you:

  • Aspire launches all services defined in the AppHost
  • Aspire launches the Dapr sidecars that are configured using .WithDaprSidecar()
  • Dapr components are loaded automatically by each sidecar based on the configured ResourcesPaths.
  • Everything appears in the Aspire dashboard

No manual dapr run commands.
No launch profiles.
No need to manually assign ports.

Observability Across Both Worlds

Aspire 13.1 automatically configures OTLP endpoints and dashboard settings, so no manual environment variables are required.

The Aspire dashboard surfaces both application‑level and infrastructure‑level signals, giving you a unified view of the entire local environment.

You get two layers of visibility:

Aspire dashboard

  • Running services
  • Sidecars
  • Logs
  • Environment variables
  • Health checks

Dapr observability

  • Traces
  • Metrics
  • Component logs

Together, they give you a complete picture of both application and infrastructure behaviour.

When Aspire + Dapr Is a Great Fit

  • .NET microservices
  • Local-first development workflows
  • Teams who want a clean, discoverable environment
  • Developers who want to avoid docker-compose sprawl
  • Systems where infrastructure concerns must stay out of app code

When it’s not ideal

  • Polyglot systems
  • Environments where Aspire isn’t available
  • Teams needing full control over orchestration

Aspire can orchestrate non‑.NET services, but they won’t appear in the dashboard with the same depth of metadata, and Aspire cannot generate strongly typed project references for them.

Common Pitfalls (and How to Avoid Them)

  • Don’t mix Aspire service discovery with Dapr service invocation.
    They solve different problems.
  • Keep ports consistent when needed.
    Both Aspire and Dapr assign dynamic ports by default. If your services rely on fixed ports, you’ll need to specify them explicitly.
  • Most Dapr components don’t hot-reload.
    Restarting the AppHost when changing YAML ensures changes are picked up.

Final Thoughts

Aspire 13.1 and Dapr 1.16 work together cleanly: Aspire handles orchestration and developer experience, while Dapr provides infrastructure building blocks. Used together, they give you a fast, modern, production‑aligned development loop with almost no ceremony.