Architecture, Observability, Platform Engineering

Part 5 – Troubleshooting, Scaling, and Production Hardening

By this point in the series, you now have a fully working, multi‑environment observability pipeline, deployed and reconciled entirely through GitOps:

  • Jaeger v2 running on the OpenTelemetry Collector
  • Applications emitting traces automatically via the OpenTelemetry Operator
  • Environment‑scoped Collectors and Instrumentation CRs
  • Argo CD managing everything through ApplicationSets and sync waves

This final part focuses on what matters most in real‑world environments: operability. Deploying Jaeger v2 is easy. Running it reliably at scale with predictable performance, clear failure modes, and secure communication is where engineering judgment comes in.

This guide covers the most important lessons learned from operating OpenTelemetry and Jaeger in production.

All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository

🩺 Troubleshooting: The Most Common Issues (and How to Fix Them)

1. “I don’t see any traces.”

This is the most common issue, and it almost always comes down to one of three things:

a. Wrong OTLP endpoint

Check the app’s environment variables (injected by the Operator):

  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

If these are missing, your Instrumentation CR is not configured with an exporter.

Protocol mismatch also matters:

Your Instrumentation CR should point to 4318 for .NET auto‑instrumentation

b. Collector not listening on the expected ports

Verify:

kubectl get svc jaeger-inmemory-instance-collector -n monitoring

In this architecture, the Collector runs in the monitoring namespace, while instrumented workloads run in the apps namespace

You must see:

  • 4318 (OTLP HTTP)
  • 4317 (OTLP gRPC)

c. Auto‑instrumentation not activated

For .NET, the Operator must inject:

  • DOTNET_STARTUP_HOOKS
  • CORECLR_ENABLE_PROFILING=1
  • CORECLR_PROFILER
  • CORECLR_PROFILER_PATH

If any of these are missing:

  • The annotation is wrong
  • The Instrumentation CR is missing
  • The Operator webhook failed to mutate the pod
  • The workload was deployed before the Operator (sync wave ordering issue)

2. “Traces appear, but they’re incomplete.”

Common causes:

  • Missing propagation headers
  • Reverse proxies stripping traceparent
  • Sampling too aggressive
  • Instrumentation library not loaded

For .NET, ensure:

  • OTEL_PROPAGATORS=tracecontext,baggage
  • No middleware overwrites headers

3. “Collector is dropping spans.”

Check Collector logs:

kubectl logs deploy/jaeger-inmemory-instance-collector -n monitoring

Look for:

  • batch processor timeout
  • queue full
  • exporter failed

Fixes:

  • Increase batch processor size
  • Increase memory limits
  • Add more Collector replicas
  • Use a more performant storage backend

📈 Scaling the Collector

The OpenTelemetry Collector is extremely flexible, but scaling it requires understanding its architecture.

Horizontal scaling

You can run multiple Collector replicas behind a Service. This works well when:

  • Apps send OTLP over gRPC (load‑balanced)
  • You use stateless exporters (e.g., Tempo, OTLP → another Collector)

Vertical scaling

Increase CPU/memory when:

  • You use heavy processors (tail sampling, attributes filtering)
  • You export to slower backends (Elasticsearch, Cassandra)

Pipeline separation

For large systems, split pipelines:

  • Gateway Collectors – Receive traffic from apps
  • Aggregation Collectors – Apply sampling, filtering
  • Export Collectors – Write to storage

This isolates concerns and improves reliability.

🗄️ Choosing a Production Storage Backend

The demo uses memstore, which is perfect for local testing but not for production. Real deployments typically use:

1. Tempo (Grafana)

  • Highly scalable
  • Cheap object storage
  • Great for high‑volume traces
  • No indexing required

2. Elasticsearch

  • Mature
  • Powerful search
  • Higher operational cost

3. ClickHouse (via Jaeger or SigNoz)

  • Extremely fast
  • Efficient storage
  • Great for long retention

4. Cassandra

  • Historically used by Jaeger v1
  • Still supported
  • Operationally heavy

For most modern setups, Tempo or ClickHouse are the best choices.

🔐 Security Considerations

1. TLS everywhere

Enable TLS for:

  • OTLP ingestion
  • Collector → backend communication
  • Jaeger UI

2. mTLS for workloads

The Collector supports mTLS for OTLP:

  • Prevents spoofed telemetry
  • Ensures only trusted workloads send data

3. Network policies

Lock down:

  • Collector ports
  • Storage backend
  • Jaeger UI

4. Secrets management

Use:

  • Kubernetes Secrets (encrypted at rest)
  • External secret stores (Vault, SSM, Azure Key Vault)

Never hardcode credentials in Collector configs.

🧪 Sampling Strategies

Sampling is one of the most misunderstood parts of tracing. The wrong sampling strategy can make your traces useless.

Head sampling (default)

  • Simple
  • Fast
  • Drops spans early
  • Good for high‑volume systems

Tail sampling

  • Makes decisions after seeing the full trace
  • Better for error‑focused sampling
  • More expensive
  • Requires dedicated Collector pipelines

Adaptive sampling

  • Dynamically adjusts sampling rates
  • Useful for spiky workloads

Best practice

Start with head sampling, then introduce tail sampling only if you need it.

🌐 Multi‑Cluster and Multi‑Environment Patterns

As your platform grows, you may need:

1. Per‑cluster Collectors, shared backend

Each cluster runs its own Collector, exporting to a central storage backend.

2. Centralized Collector fleet

Apps send OTLP to a global Collector layer.

3. GitOps per environment

Structure your repo like:

environments/
  dev/
  staging/
  prod/

This series includes only dev, but the structure supports adding staging and prod easily.

Each environment can have:

  • Different sampling
  • Different storage
  • Different Collector pipelines

🧭 Final Thoughts

Jaeger v2, OpenTelemetry, and GitOps form a powerful, modern observability stack. Across this series, you’ve built:

  • A Jaeger v2 deployment using the OpenTelemetry Collector
  • A .NET application emitting traces with zero code changes
  • A GitOps workflow that keeps everything declarative and self‑healing
  • A production‑ready understanding of scaling, troubleshooting, and hardening

This is the kind of architecture that scales with your platform, not against it. It’s simple where it should be simple, flexible where it needs to be flexible, and grounded in open standards.

Architecture, Cloud Native, Observability

Part 1 – Why Jaeger v2, OpenTelemetry, and GitOps Belong Together

Modern distributed systems generate a staggering amount of telemetry. Logs, metrics, and traces flow from dozens or hundreds of independently deployed services. Teams want deep visibility without drowning in operational overhead. They want consistency without slowing down delivery. And they want observability that scales with the system, not against it.

This is where Jaeger v2, OpenTelemetry, and GitOps converge into a clean, modern, future‑proof model.

This series walks through a complete, working setup that combines:

  • Jaeger v2, built on the OpenTelemetry Collector
  • OpenTelemetry auto‑instrumentation, with a focus on .NET
  • ArgoCD, managing everything declaratively through GitOps
  • A multi‑environment architecture, with dev/staging/prod deployed through ApplicationSets

Before we dive into YAML, pipelines, and instrumentation, it’s worth understanding why these technologies fit together so naturally and why they represent the future of platform‑level observability.

All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository

🧭 The Shift to Jaeger v2: Collector‑First Observability

Jaeger v1 was built around a bespoke architecture: agents, collectors, query services, and storage backends. It worked well for its time, but it wasn’t aligned with the industry’s move toward OpenTelemetry as the standard for telemetry data.

Jaeger v2 changes that.

What’s new in Jaeger v2

  • Built on the OpenTelemetry Collector
  • Accepts OTLP as the ingestion protocol
  • Consolidates components into a simpler deployment
  • Integrates Jaeger’s query and UI directly into the Collector
  • Aligns with the OpenTelemetry ecosystem instead of maintaining parallel infrastructure

In practice, Jaeger v2 is no longer a standalone tracing pipeline.
It is a distribution of the OpenTelemetry Collector, with Jaeger’s query and UI components integrated into the same deployment.

This reduces operational complexity and brings Jaeger into the same ecosystem as metrics, logs, and traces, all flowing through the same Collector pipeline.

🌐 OpenTelemetry: The Universal Instrumentation Layer

OpenTelemetry has become the de facto standard for collecting telemetry across languages and platforms. Instead of maintaining language‑specific SDKs, exporters, and agents, teams can rely on a unified model:

  • One protocol (OTLP)
  • One collector pipeline
  • One set of instrumentation libraries
  • One ecosystem of processors, exporters, and extensions

For application teams, this means:

  • Less vendor lock‑in
  • Less custom instrumentation
  • More consistency across services

For platform teams, it means:

  • A single collector pipeline to operate
  • A single place to apply sampling, filtering, and routing
  • A consistent deployment model across environments

And with the OpenTelemetry Operator, you can enable auto‑instrumentation, especially for languages like .NET, without touching application code. The Operator injects the right environment variables, startup hooks, and exporters automatically.

🚀 Why GitOps (ArgoCD) Completes the Picture

Observability components are critical infrastructure. They need to be:

  • Versioned
  • Auditable
  • Reproducible
  • Consistent across environments

GitOps provides exactly that.

With ArgoCD:

  • The Collector configuration lives in Git
  • The Instrumentation settings live in Git
  • The Jaeger UI and supporting components live in Git
  • The applications live in Git
  • The environment‑specific overrides live in Git

ArgoCD continuously ensures that the cluster matches what’s declared in the repository. If someone changes a Collector config manually, ArgoCD corrects it. If a deployment drifts, ArgoCD heals it. If you want to roll out a new sampling policy, you commit a change and let ArgoCD sync it.

Git becomes the single source of truth for your entire observability stack.

🏗️ How These Pieces Fit Together

Here’s the high‑level architecture this series will build:

  • OpenTelemetry Collector (Jaeger v2)
    • Receives OTLP traffic
    • Processes and exports traces
    • Hosts the Jaeger v2 query and UI components
  • Applications
    • Auto‑instrumented using OpenTelemetry agents
    • Emit traces to the Collector via OTLP
  • ArgoCD
    • Watches the Git repository
    • Applies Collector, Instrumentation, and app manifests
    • Uses ApplicationSets to generate per‑environment deployments
    • Enforces ordering with sync waves
    • Ensures everything stays in sync

This architecture is intentionally simple. It’s designed to be:

  • Easy to deploy
  • Easy to understand
  • Easy to extend into production patterns
  • Easy to scale across environments and clusters

🎯 What You’ll Learn in This Series

Over the next four parts, we’ll walk through:

Part 2 – Deploying Jaeger v2 with the OpenTelemetry Collector

A working Collector configuration, including receivers, processors, exporters, and the Jaeger UI.

Part 3 – Auto‑instrumenting .NET with OpenTelemetry

How to enable tracing in a .NET application without modifying code, using the OpenTelemetry .NET auto‑instrumentation agent.

Part 4 – Managing Everything with ArgoCD (GitOps)

How to structure your repo, define ArgoCD Applications, and sync the entire observability stack declaratively.

Part 5 – Troubleshooting, Scaling, and Production Hardening

Sampling strategies, storage backends, multi‑cluster patterns, and common pitfalls.

🧩 Why This Matters

Observability is no longer optional. It’s foundational. But the tooling landscape has been fragmented for years. Jaeger v2, OpenTelemetry, and GitOps represent a convergence toward:

  • Standardisation
  • Operational simplicity
  • Developer autonomy
  • Platform consistency

This series is designed to give you a practical, reproducible path to adopting that model, starting with the simplest working setup and building toward production‑ready patterns.

You can find the full configuration for this part — including the Collector manifests and Argo CD setup — in the GitHub repository