Architecture, Cloud Native, Dapr Series, Platform Engineering

Part 1 – What is Dapr, and Why Would You Use It?

Distributed systems are powerful, but they come with a familiar cost: every service ends up carrying a surprising amount of infrastructure code. Database clients, message broker SDKs, storage SDKs, retry logic, connection handling, secrets, configuration, observability, the list grows quickly.

Most teams don’t notice this at first. It feels normal.
But over time, the weight becomes obvious:

  • Every service looks different
  • Every SDK behaves differently
  • Every language has its own patterns
  • Every infrastructure change requires code changes
  • Local development becomes fragile
  • Testing becomes harder
  • Onboarding slows down

This is the problem space where Dapr lives.

What Dapr Is

Dapr (the Distributed Application Runtime) is a runtime that provides a set of building blocks for common distributed system capabilities:

  • State management
  • Pub/Sub
  • Bindings to external systems
  • Secrets
  • Service invocation
  • Observability

Each building block exposes a consistent API, regardless of the underlying infrastructure.

Your service talks to Dapr.
Dapr talks to Redis, Kafka, S3, Postgres, Service Bus, and more.

Dapr runs as a sidecar next to your service, exposing HTTP and gRPC endpoints your application can call.

This separation keeps your application code clean, portable, and focused on business logic.

What Dapr Is Not

Dapr is not:

  • a database
  • a message broker
  • a workflow engine
  • a service mesh
  • a replacement for Kubernetes
  • a silver bullet

It doesn’t remove the need to understand your infrastructure.
It removes the need to couple your application code to it.

Dapr can run alongside service meshes, orchestrators, and cloud‑native tooling, they solve different problems.

Why Dapr Exists (The Real Problem It Solves)

Most developers think the pain is:

“I have to write boilerplate code for Redis, Kafka, S3…”

But the real pain is deeper:

1. SDK sprawl

Every SDK has its own:

  • retry semantics
  • connection lifecycle
  • error model
  • configuration
  • authentication
  • testing story

Multiply that across languages and teams, and the system becomes inconsistent and hard to evolve.

2. Infrastructure leaking into application code

Connection strings, broker details, storage paths, all embedded in services.

3. Local development drift

Running Redis, Kafka, storage, secrets, and multiple services locally is painful and rarely matches production.

4. Polyglot inconsistency

Go, .NET, Python, Java, each has different libraries, patterns, and failure modes.

5. Infrastructure churn

Switching from Redis to Postgres, or Kafka to Service Bus or RabbitMQ, becomes a multi‑service refactor.

6. Service-to-service communication complexity

Retries, timeouts, discovery, identity, and mTLS all behave differently across languages and frameworks.

Dapr solves these problems by providing consistent, portable building blocks that sit beside your service, not inside it.

Why AI Doesn’t Replace Dapr

AI can generate code. Dapr removes the need to write certain kinds of code.
These are not the same thing.

AI tools (Copilot, ChatGPT, etc.) can help you write code faster, but they do not:

  • provide service discovery
  • implement retries, backoff, or circuit breakers
  • enforce mTLS
  • manage secrets
  • abstract cloud infrastructure
  • provide a consistent API surface across languages
  • run as a sidecar
  • handle distributed tracing
  • guarantee idempotency
  • manage state consistency
  • orchestrate pub/sub delivery
  • provide actor placement
  • run workflows
  • integrate with brokers or databases
  • provide runtime‑level resiliency

AI can describe these patterns.
AI can generate code for these patterns.
AI cannot execute these patterns at runtime.

Dapr is a runtime, not a code generator.

AI ≠ Runtime

AI can help you write:

  • a retry loop
  • a Kafka consumer
  • a Redis client
  • a workflow engine wrapper
  • a secret retrieval helper

But AI cannot:

  • run a sidecar
  • enforce mTLS between services
  • manage distributed locks
  • guarantee delivery semantics
  • provide cross‑language consistency
  • abstract infrastructure behind a stable API
  • hot‑reload components
  • manage actor placement across nodes
  • provide a unified telemetry pipeline

These require execution, not generation.

AI-generated code still needs a runtime

Even if AI writes perfect code:

  • you still need service discovery
  • you still need retries and backoff
  • you still need state consistency
  • you still need pub/sub semantics
  • you still need secrets management
  • you still need observability
  • you still need portability
  • you still need mTLS
  • you still need infrastructure abstraction

Dapr provides these at runtime, consistently, across languages and environments.

AI cannot replace that.

AI + Dapr is actually the ideal pairing

AI helps you write business logic.
Dapr handles the distributed systems plumbing.

Together, they give you:

  • less boilerplate
  • fewer SDKs
  • fewer infrastructure decisions
  • more consistent architecture
  • faster iteration
  • safer defaults

AI accelerates development.
Dapr stabilizes execution.

They solve different problems.

Why Architects Care About Dapr

Architects think in terms of:

  • consistency
  • portability
  • governance
  • cross‑cutting concerns
  • security boundaries
  • observability
  • multi‑language teams
  • future‑proofing

Dapr provides:

  • consistent APIs across languages
  • consistent cross‑cutting behavior
  • consistent local and production environments
  • consistent observability
  • consistent security (mTLS, identity)
  • consistent patterns for state, events, and external systems

It gives architects a way to standardise distributed system capabilities without forcing a specific language, framework, or service mesh.

Why This Series Exists

Dapr’s documentation explains each building block clearly.
What it doesn’t try to do is:

  • show how those pieces fit together in real systems
  • explain how Dapr helps in day‑to‑day engineering
  • address developer and architect objections
  • show how to run Dapr locally in a practical way
  • provide a polyglot example that feels real
  • explain what Dapr solves, and what it doesn’t

This series fills that gap.

It’s designed to answer three questions:

1. What is Dapr?

A runtime that provides consistent building blocks for distributed systems.

2. Why would I use it?

To reduce complexity, improve consistency, and keep infrastructure out of application code.

3. How do I get up and running?

By running Dapr locally, building a real service, and understanding how it fits into your architecture.

What This Series Covers

Over the next posts, we’ll walk through:

  • Running Dapr locally
  • State management
  • Pub/Sub
  • Bindings
  • Observability
  • Building a real service in Go and .NET
  • Deploying to Kubernetes
  • Using Dapr with .NET Aspire (bonus)

Each part includes practical examples you can run yourself.

What You’ll Be Able to Do by the End

By the end of this series, you’ll know how to:

  • Build services that don’t depend on infrastructure SDKs
  • Run a multi‑service system locally with consistent behavior
  • Store state, publish events, and integrate with external systems
  • Observe cross‑service flows with zero instrumentation
  • Deploy the same patterns to Kubernetes
  • Build polyglot services that share the same architecture
  • Understand where Dapr fits, and where it doesn’t

Architecture, Observability, Platform Engineering

Part 5 – Troubleshooting, Scaling, and Production Hardening

By this point in the series, you now have a fully working, multi‑environment observability pipeline, deployed and reconciled entirely through GitOps:

  • Jaeger v2 running on the OpenTelemetry Collector
  • Applications emitting traces automatically via the OpenTelemetry Operator
  • Environment‑scoped Collectors and Instrumentation CRs
  • Argo CD managing everything through ApplicationSets and sync waves

This final part focuses on what matters most in real‑world environments: operability. Deploying Jaeger v2 is easy. Running it reliably at scale with predictable performance, clear failure modes, and secure communication is where engineering judgment comes in.

This guide covers the most important lessons learned from operating OpenTelemetry and Jaeger in production.

All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository

🩺 Troubleshooting: The Most Common Issues (and How to Fix Them)

1. “I don’t see any traces.”

This is the most common issue, and it almost always comes down to one of three things:

a. Wrong OTLP endpoint

Check the app’s environment variables (injected by the Operator):

  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

If these are missing, your Instrumentation CR is not configured with an exporter.

Protocol mismatch also matters:

Your Instrumentation CR should point to 4318 for .NET auto‑instrumentation

b. Collector not listening on the expected ports

Verify:

kubectl get svc jaeger-inmemory-instance-collector -n monitoring

In this architecture, the Collector runs in the monitoring namespace, while instrumented workloads run in the apps namespace

You must see:

  • 4318 (OTLP HTTP)
  • 4317 (OTLP gRPC)

c. Auto‑instrumentation not activated

For .NET, the Operator must inject:

  • DOTNET_STARTUP_HOOKS
  • CORECLR_ENABLE_PROFILING=1
  • CORECLR_PROFILER
  • CORECLR_PROFILER_PATH

If any of these are missing:

  • The annotation is wrong
  • The Instrumentation CR is missing
  • The Operator webhook failed to mutate the pod
  • The workload was deployed before the Operator (sync wave ordering issue)

2. “Traces appear, but they’re incomplete.”

Common causes:

  • Missing propagation headers
  • Reverse proxies stripping traceparent
  • Sampling too aggressive
  • Instrumentation library not loaded

For .NET, ensure:

  • OTEL_PROPAGATORS=tracecontext,baggage
  • No middleware overwrites headers

3. “Collector is dropping spans.”

Check Collector logs:

kubectl logs deploy/jaeger-inmemory-instance-collector -n monitoring

Look for:

  • batch processor timeout
  • queue full
  • exporter failed

Fixes:

  • Increase batch processor size
  • Increase memory limits
  • Add more Collector replicas
  • Use a more performant storage backend

📈 Scaling the Collector

The OpenTelemetry Collector is extremely flexible, but scaling it requires understanding its architecture.

Horizontal scaling

You can run multiple Collector replicas behind a Service. This works well when:

  • Apps send OTLP over gRPC (load‑balanced)
  • You use stateless exporters (e.g., Tempo, OTLP → another Collector)

Vertical scaling

Increase CPU/memory when:

  • You use heavy processors (tail sampling, attributes filtering)
  • You export to slower backends (Elasticsearch, Cassandra)

Pipeline separation

For large systems, split pipelines:

  • Gateway Collectors – Receive traffic from apps
  • Aggregation Collectors – Apply sampling, filtering
  • Export Collectors – Write to storage

This isolates concerns and improves reliability.

🗄️ Choosing a Production Storage Backend

The demo uses memstore, which is perfect for local testing but not for production. Real deployments typically use:

1. Tempo (Grafana)

  • Highly scalable
  • Cheap object storage
  • Great for high‑volume traces
  • No indexing required

2. Elasticsearch

  • Mature
  • Powerful search
  • Higher operational cost

3. ClickHouse (via Jaeger or SigNoz)

  • Extremely fast
  • Efficient storage
  • Great for long retention

4. Cassandra

  • Historically used by Jaeger v1
  • Still supported
  • Operationally heavy

For most modern setups, Tempo or ClickHouse are the best choices.

🔐 Security Considerations

1. TLS everywhere

Enable TLS for:

  • OTLP ingestion
  • Collector → backend communication
  • Jaeger UI

2. mTLS for workloads

The Collector supports mTLS for OTLP:

  • Prevents spoofed telemetry
  • Ensures only trusted workloads send data

3. Network policies

Lock down:

  • Collector ports
  • Storage backend
  • Jaeger UI

4. Secrets management

Use:

  • Kubernetes Secrets (encrypted at rest)
  • External secret stores (Vault, SSM, Azure Key Vault)

Never hardcode credentials in Collector configs.

🧪 Sampling Strategies

Sampling is one of the most misunderstood parts of tracing. The wrong sampling strategy can make your traces useless.

Head sampling (default)

  • Simple
  • Fast
  • Drops spans early
  • Good for high‑volume systems

Tail sampling

  • Makes decisions after seeing the full trace
  • Better for error‑focused sampling
  • More expensive
  • Requires dedicated Collector pipelines

Adaptive sampling

  • Dynamically adjusts sampling rates
  • Useful for spiky workloads

Best practice

Start with head sampling, then introduce tail sampling only if you need it.

🌐 Multi‑Cluster and Multi‑Environment Patterns

As your platform grows, you may need:

1. Per‑cluster Collectors, shared backend

Each cluster runs its own Collector, exporting to a central storage backend.

2. Centralized Collector fleet

Apps send OTLP to a global Collector layer.

3. GitOps per environment

Structure your repo like:

environments/
  dev/
  staging/
  prod/

This series includes only dev, but the structure supports adding staging and prod easily.

Each environment can have:

  • Different sampling
  • Different storage
  • Different Collector pipelines

🧭 Final Thoughts

Jaeger v2, OpenTelemetry, and GitOps form a powerful, modern observability stack. Across this series, you’ve built:

  • A Jaeger v2 deployment using the OpenTelemetry Collector
  • A .NET application emitting traces with zero code changes
  • A GitOps workflow that keeps everything declarative and self‑healing
  • A production‑ready understanding of scaling, troubleshooting, and hardening

This is the kind of architecture that scales with your platform, not against it. It’s simple where it should be simple, flexible where it needs to be flexible, and grounded in open standards.