Architecture, Observability, Platform Engineering

Part 5 – Troubleshooting, Scaling, and Production Hardening

By this point in the series, you now have a fully working, multi‑environment observability pipeline, deployed and reconciled entirely through GitOps:

  • Jaeger v2 running on the OpenTelemetry Collector
  • Applications emitting traces automatically via the OpenTelemetry Operator
  • Environment‑scoped Collectors and Instrumentation CRs
  • Argo CD managing everything through ApplicationSets and sync waves

This final part focuses on what matters most in real‑world environments: operability. Deploying Jaeger v2 is easy. Running it reliably at scale with predictable performance, clear failure modes, and secure communication is where engineering judgment comes in.

This guide covers the most important lessons learned from operating OpenTelemetry and Jaeger in production.

All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository

🩺 Troubleshooting: The Most Common Issues (and How to Fix Them)

1. “I don’t see any traces.”

This is the most common issue, and it almost always comes down to one of three things:

a. Wrong OTLP endpoint

Check the app’s environment variables (injected by the Operator):

  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

If these are missing, your Instrumentation CR is not configured with an exporter.

Protocol mismatch also matters:

Your Instrumentation CR should point to 4318 for .NET auto‑instrumentation

b. Collector not listening on the expected ports

Verify:

kubectl get svc jaeger-inmemory-instance-collector -n monitoring

In this architecture, the Collector runs in the monitoring namespace, while instrumented workloads run in the apps namespace

You must see:

  • 4318 (OTLP HTTP)
  • 4317 (OTLP gRPC)

c. Auto‑instrumentation not activated

For .NET, the Operator must inject:

  • DOTNET_STARTUP_HOOKS
  • CORECLR_ENABLE_PROFILING=1
  • CORECLR_PROFILER
  • CORECLR_PROFILER_PATH

If any of these are missing:

  • The annotation is wrong
  • The Instrumentation CR is missing
  • The Operator webhook failed to mutate the pod
  • The workload was deployed before the Operator (sync wave ordering issue)

2. “Traces appear, but they’re incomplete.”

Common causes:

  • Missing propagation headers
  • Reverse proxies stripping traceparent
  • Sampling too aggressive
  • Instrumentation library not loaded

For .NET, ensure:

  • OTEL_PROPAGATORS=tracecontext,baggage
  • No middleware overwrites headers

3. “Collector is dropping spans.”

Check Collector logs:

kubectl logs deploy/jaeger-inmemory-instance-collector -n monitoring

Look for:

  • batch processor timeout
  • queue full
  • exporter failed

Fixes:

  • Increase batch processor size
  • Increase memory limits
  • Add more Collector replicas
  • Use a more performant storage backend

📈 Scaling the Collector

The OpenTelemetry Collector is extremely flexible, but scaling it requires understanding its architecture.

Horizontal scaling

You can run multiple Collector replicas behind a Service. This works well when:

  • Apps send OTLP over gRPC (load‑balanced)
  • You use stateless exporters (e.g., Tempo, OTLP → another Collector)

Vertical scaling

Increase CPU/memory when:

  • You use heavy processors (tail sampling, attributes filtering)
  • You export to slower backends (Elasticsearch, Cassandra)

Pipeline separation

For large systems, split pipelines:

  • Gateway Collectors – Receive traffic from apps
  • Aggregation Collectors – Apply sampling, filtering
  • Export Collectors – Write to storage

This isolates concerns and improves reliability.

🗄️ Choosing a Production Storage Backend

The demo uses memstore, which is perfect for local testing but not for production. Real deployments typically use:

1. Tempo (Grafana)

  • Highly scalable
  • Cheap object storage
  • Great for high‑volume traces
  • No indexing required

2. Elasticsearch

  • Mature
  • Powerful search
  • Higher operational cost

3. ClickHouse (via Jaeger or SigNoz)

  • Extremely fast
  • Efficient storage
  • Great for long retention

4. Cassandra

  • Historically used by Jaeger v1
  • Still supported
  • Operationally heavy

For most modern setups, Tempo or ClickHouse are the best choices.

🔐 Security Considerations

1. TLS everywhere

Enable TLS for:

  • OTLP ingestion
  • Collector → backend communication
  • Jaeger UI

2. mTLS for workloads

The Collector supports mTLS for OTLP:

  • Prevents spoofed telemetry
  • Ensures only trusted workloads send data

3. Network policies

Lock down:

  • Collector ports
  • Storage backend
  • Jaeger UI

4. Secrets management

Use:

  • Kubernetes Secrets (encrypted at rest)
  • External secret stores (Vault, SSM, Azure Key Vault)

Never hardcode credentials in Collector configs.

🧪 Sampling Strategies

Sampling is one of the most misunderstood parts of tracing. The wrong sampling strategy can make your traces useless.

Head sampling (default)

  • Simple
  • Fast
  • Drops spans early
  • Good for high‑volume systems

Tail sampling

  • Makes decisions after seeing the full trace
  • Better for error‑focused sampling
  • More expensive
  • Requires dedicated Collector pipelines

Adaptive sampling

  • Dynamically adjusts sampling rates
  • Useful for spiky workloads

Best practice

Start with head sampling, then introduce tail sampling only if you need it.

🌐 Multi‑Cluster and Multi‑Environment Patterns

As your platform grows, you may need:

1. Per‑cluster Collectors, shared backend

Each cluster runs its own Collector, exporting to a central storage backend.

2. Centralized Collector fleet

Apps send OTLP to a global Collector layer.

3. GitOps per environment

Structure your repo like:

environments/
  dev/
  staging/
  prod/

This series includes only dev, but the structure supports adding staging and prod easily.

Each environment can have:

  • Different sampling
  • Different storage
  • Different Collector pipelines

🧭 Final Thoughts

Jaeger v2, OpenTelemetry, and GitOps form a powerful, modern observability stack. Across this series, you’ve built:

  • A Jaeger v2 deployment using the OpenTelemetry Collector
  • A .NET application emitting traces with zero code changes
  • A GitOps workflow that keeps everything declarative and self‑healing
  • A production‑ready understanding of scaling, troubleshooting, and hardening

This is the kind of architecture that scales with your platform, not against it. It’s simple where it should be simple, flexible where it needs to be flexible, and grounded in open standards.

.NET, Observability, OpenTelemetry

Part 3 – Auto‑Instrumenting .NET with OpenTelemetry

In Part 2, we deployed Jaeger v2 using the OpenTelemetry Collector and exposed the Jaeger UI. Now it’s time to generate real traces without modifying application code or rebuilding container images.

This part shows how to use the OpenTelemetry Operator to inject the .NET auto‑instrumentation agent automatically. This approach is fully declarative, GitOps‑friendly, and ideal for platform teams who want consistent instrumentation across many services.

All manifests, ApplicationSets, Code and configuration used in this series are available in the companion GitHub repository

🧠 How Operator‑Managed .NET Auto‑Instrumentation Works

The OpenTelemetry Operator can automatically:

  • Inject the .NET auto‑instrumentation agent into your pod
  • Mount the agent files
  • Set all required environment variables
  • Configure OTLP exporters
  • Apply propagators
  • Ensure consistent agent versions across workloads

This means:

  • No Dockerfile changes
  • No manual environment variables
  • No code changes
  • No per‑service configuration drift

Instrumentation becomes a cluster‑level concern, not an application‑level burden.

📦 Defining the .NET Instrumentation Resource

To enable .NET auto‑instrumentation, create an Instrumentation CR

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-dotnet
  namespace: apps
spec:
  dotnet:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:latest

This tells the Operator:

  • Manage the lifecycle of the agent declaratively
  • Use the official .NET auto‑instrumentation agent
  • Inject it into workloads in this namespace (or those that opt‑in)

Commit this file to Git and let ArgoCD sync it.

🏗️ Instrumenting a .NET Application (No Image Changes Required)

To instrument a .NET application, you simply annotate the Deployment:

metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-dotnet: "true"

That’s it.

The Operator will:

  • Inject the agent
  • Mount the instrumentation files
  • Set all required environment variables
  • Configure the OTLP exporter
  • Enrich traces with Kubernetes metadata

Your Deployment YAML stays clean and simple.

📁 Example .NET Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dev-demo-dotnet
  namespace: apps
  annotations:
    instrumentation.opentelemetry.io/inject-dotnet: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dev-demo-dotnet
  template:
    metadata:
      labels:
        app: dev-demo-dotnet
    spec:
      containers:
        - name: dev-demo-dotnet
          image:  demo-dotnet:latest
          ports:
            - containerPort: 8080

Notice what’s missing:

  • No agent download
  • No Dockerfile changes
  • No environment variables
  • No profiler configuration

The Operator handles everything.

🔬 What the Operator Injects (Real Example)

Here is a simplified version of the actual mutated pod from your cluster. This shows exactly what the Operator adds:

initContainers:
  - name: opentelemetry-auto-instrumentation-dotnet
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:latest
    command: ["cp", "-r", "/autoinstrumentation/.", "/otel-auto-instrumentation-dotnet"]

Injected environment variables

env:
  - name: CORECLR_ENABLE_PROFILING
    value: "1"
  - name: CORECLR_PROFILER
    value: "{918728DD-259F-4A6A-AC2B-B85E1B658318}"
  - name: CORECLR_PROFILER_PATH
    value: /otel-auto-instrumentation-dotnet/linux-x64/OpenTelemetry.AutoInstrumentation.Native.so
  - name: DOTNET_STARTUP_HOOKS
    value: /otel-auto-instrumentation-dotnet/net/OpenTelemetry.AutoInstrumentation.StartupHook.dll
  - name: DOTNET_ADDITIONAL_DEPS
    value: /otel-auto-instrumentation-dotnet/AdditionalDeps
  - name: DOTNET_SHARED_STORE
    value: /otel-auto-instrumentation-dotnet/store
  - name: OTEL_DOTNET_AUTO_HOME
    value: /otel-auto-instrumentation-dotnet
  - name: OTEL_SERVICE_NAME
    value: dev-demo-dotnet
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: http://jaeger-inmemory-instance-collector.monitoring.svc.cluster.local:4318

Kubernetes metadata enrichment

- name: OTEL_RESOURCE_ATTRIBUTES
  value: k8s.container.name=dev-demo-dotnet,...

Volume for instrumentation files

volumes:
  - name: opentelemetry-auto-instrumentation-dotnet
    emptyDir:
      sizeLimit: 200Mi

This is the Operator doing exactly what it was designed to do:
injecting a complete, production‑grade instrumentation layer without touching your application code.

🚀 Deploying the Instrumented App

Once the Instrumentation CR and Deployment are committed:

  1. ArgoCD syncs the changes
  2. The Operator mutates the pod
  3. The .NET agent is injected
  4. The app begins emitting OTLP traces

Check the pod:

kubectl get pods -n apps

You’ll see:

  • An init container
  • A mounted instrumentation volume
  • Injected environment variables

🔍 Verifying That Traces Are Flowing

1. Port‑forward the Jaeger UI

kubectl -n monitoring port-forward svc/jaeger-inmemory-instance-collector 16686:16686

Open:

http://localhost:16686

2. Generate traffic

kubectl -n apps port-forward svc/dev-demo-dotnet 8080:8080
curl http://localhost:8080/

3. Check the Jaeger UI

You should now see:

  • Service: dev-demo-dotnet
  • HTTP server spans
  • Outgoing calls (if any)
  • Full trace graphs

If you see traces, the Operator‑managed pipeline is working end‑to‑end.

🧪 Troubleshooting Common Issues

No traces appear

  • Ensure the Deployment has the annotation
  • Ensure the Instrumentation CR is in the same namespace
  • Check Operator logs for mutation errors
  • Verify the Collector’s OTLP ports (4317/4318)

App restarts repeatedly

  • The Operator may be injecting into a non‑.NET container
  • Ensure your image is .NET 8+

Traces appear but missing context

  • The Operator sets tracecontext,baggage automatically
  • Ensure no middleware strips headers

🧭 What’s Next

With Jaeger v2 deployed and .NET auto‑instrumentation fully automated, you now have a working observability pipeline that requires:

  • No code changes
  • No image modifications
  • No per‑service configuration

In Part 4, we’ll take this setup and make it fully declarative using ArgoCD:

  • Repo structure
  • ArgoCD Applications
  • Sync strategies
  • Drift correction
  • Multi‑component GitOps workflows

This is where the system becomes operationally robust.