By this point in the series, you now have a fully working, multi‑environment observability pipeline, deployed and reconciled entirely through GitOps:
- Jaeger v2 running on the OpenTelemetry Collector
- Applications emitting traces automatically via the OpenTelemetry Operator
- Environment‑scoped Collectors and Instrumentation CRs
- Argo CD managing everything through ApplicationSets and sync waves
This final part focuses on what matters most in real‑world environments: operability. Deploying Jaeger v2 is easy. Running it reliably at scale with predictable performance, clear failure modes, and secure communication is where engineering judgment comes in.
This guide covers the most important lessons learned from operating OpenTelemetry and Jaeger in production.
All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository
🩺 Troubleshooting: The Most Common Issues (and How to Fix Them)
1. “I don’t see any traces.”
This is the most common issue, and it almost always comes down to one of three things:
a. Wrong OTLP endpoint
Check the app’s environment variables (injected by the Operator):
OTEL_EXPORTER_OTLP_ENDPOINTOTEL_EXPORTER_OTLP_TRACES_ENDPOINT
If these are missing, your Instrumentation CR is not configured with an exporter.
Protocol mismatch also matters:
Your Instrumentation CR should point to 4318 for .NET auto‑instrumentation
b. Collector not listening on the expected ports
Verify:
kubectl get svc jaeger-inmemory-instance-collector -n monitoring
In this architecture, the Collector runs in the monitoring namespace, while instrumented workloads run in the apps namespace
You must see:
- 4318 (OTLP HTTP)
- 4317 (OTLP gRPC)
c. Auto‑instrumentation not activated
For .NET, the Operator must inject:
DOTNET_STARTUP_HOOKSCORECLR_ENABLE_PROFILING=1CORECLR_PROFILERCORECLR_PROFILER_PATH
If any of these are missing:
- The annotation is wrong
- The Instrumentation CR is missing
- The Operator webhook failed to mutate the pod
- The workload was deployed before the Operator (sync wave ordering issue)
2. “Traces appear, but they’re incomplete.”
Common causes:
- Missing propagation headers
- Reverse proxies stripping
traceparent - Sampling too aggressive
- Instrumentation library not loaded
For .NET, ensure:
OTEL_PROPAGATORS=tracecontext,baggage- No middleware overwrites headers
3. “Collector is dropping spans.”
Check Collector logs:
kubectl logs deploy/jaeger-inmemory-instance-collector -n monitoring
Look for:
batch processor timeoutqueue fullexporter failed
Fixes:
- Increase batch processor size
- Increase memory limits
- Add more Collector replicas
- Use a more performant storage backend
📈 Scaling the Collector
The OpenTelemetry Collector is extremely flexible, but scaling it requires understanding its architecture.
Horizontal scaling
You can run multiple Collector replicas behind a Service. This works well when:
- Apps send OTLP over gRPC (load‑balanced)
- You use stateless exporters (e.g., Tempo, OTLP → another Collector)
Vertical scaling
Increase CPU/memory when:
- You use heavy processors (tail sampling, attributes filtering)
- You export to slower backends (Elasticsearch, Cassandra)
Pipeline separation
For large systems, split pipelines:
- Gateway Collectors – Receive traffic from apps
- Aggregation Collectors – Apply sampling, filtering
- Export Collectors – Write to storage
This isolates concerns and improves reliability.
🗄️ Choosing a Production Storage Backend
The demo uses memstore, which is perfect for local testing but not for production. Real deployments typically use:
1. Tempo (Grafana)
- Highly scalable
- Cheap object storage
- Great for high‑volume traces
- No indexing required
2. Elasticsearch
- Mature
- Powerful search
- Higher operational cost
3. ClickHouse (via Jaeger or SigNoz)
- Extremely fast
- Efficient storage
- Great for long retention
4. Cassandra
- Historically used by Jaeger v1
- Still supported
- Operationally heavy
For most modern setups, Tempo or ClickHouse are the best choices.
🔐 Security Considerations
1. TLS everywhere
Enable TLS for:
- OTLP ingestion
- Collector → backend communication
- Jaeger UI
2. mTLS for workloads
The Collector supports mTLS for OTLP:
- Prevents spoofed telemetry
- Ensures only trusted workloads send data
3. Network policies
Lock down:
- Collector ports
- Storage backend
- Jaeger UI
4. Secrets management
Use:
- Kubernetes Secrets (encrypted at rest)
- External secret stores (Vault, SSM, Azure Key Vault)
Never hardcode credentials in Collector configs.
🧪 Sampling Strategies
Sampling is one of the most misunderstood parts of tracing. The wrong sampling strategy can make your traces useless.
Head sampling (default)
- Simple
- Fast
- Drops spans early
- Good for high‑volume systems
Tail sampling
- Makes decisions after seeing the full trace
- Better for error‑focused sampling
- More expensive
- Requires dedicated Collector pipelines
Adaptive sampling
- Dynamically adjusts sampling rates
- Useful for spiky workloads
Best practice
Start with head sampling, then introduce tail sampling only if you need it.
🌐 Multi‑Cluster and Multi‑Environment Patterns
As your platform grows, you may need:
1. Per‑cluster Collectors, shared backend
Each cluster runs its own Collector, exporting to a central storage backend.
2. Centralized Collector fleet
Apps send OTLP to a global Collector layer.
3. GitOps per environment
Structure your repo like:
environments/
dev/
staging/
prod/
This series includes only dev, but the structure supports adding staging and prod easily.
Each environment can have:
- Different sampling
- Different storage
- Different Collector pipelines
🧭 Final Thoughts
Jaeger v2, OpenTelemetry, and GitOps form a powerful, modern observability stack. Across this series, you’ve built:
- A Jaeger v2 deployment using the OpenTelemetry Collector
- A .NET application emitting traces with zero code changes
- A GitOps workflow that keeps everything declarative and self‑healing
- A production‑ready understanding of scaling, troubleshooting, and hardening
This is the kind of architecture that scales with your platform, not against it. It’s simple where it should be simple, flexible where it needs to be flexible, and grounded in open standards.