Argo CD | Coding With Taz

By this point in the series, you now have a fully working, multi‑environment observability pipeline, deployed and reconciled entirely through GitOps:

Jaeger v2 running on the OpenTelemetry Collector
Applications emitting traces automatically via the OpenTelemetry Operator
Environment‑scoped Collectors and Instrumentation CRs
Argo CD managing everything through ApplicationSets and sync waves

This final part focuses on what matters most in real‑world environments: operability. Deploying Jaeger v2 is easy. Running it reliably at scale with predictable performance, clear failure modes, and secure communication is where engineering judgment comes in.

This guide covers the most important lessons learned from operating OpenTelemetry and Jaeger in production.

All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository

🩺 Troubleshooting: The Most Common Issues (and How to Fix Them)

1. “I don’t see any traces.”

This is the most common issue, and it almost always comes down to one of three things:

a. Wrong OTLP endpoint

Check the app’s environment variables (injected by the Operator):

OTEL_EXPORTER_OTLP_ENDPOINT
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT

If these are missing, your Instrumentation CR is not configured with an exporter.

Protocol mismatch also matters:

Your Instrumentation CR should point to 4318 for .NET auto‑instrumentation

b. Collector not listening on the expected ports

Verify:

kubectl get svc jaeger-inmemory-instance-collector -n monitoring

In this architecture, the Collector runs in the monitoring namespace, while instrumented workloads run in the apps namespace

You must see:

4318 (OTLP HTTP)
4317 (OTLP gRPC)

c. Auto‑instrumentation not activated

For .NET, the Operator must inject:

DOTNET_STARTUP_HOOKS
CORECLR_ENABLE_PROFILING=1
CORECLR_PROFILER
CORECLR_PROFILER_PATH

If any of these are missing:

The annotation is wrong
The Instrumentation CR is missing
The Operator webhook failed to mutate the pod
The workload was deployed before the Operator (sync wave ordering issue)

2. “Traces appear, but they’re incomplete.”

Common causes:

Missing propagation headers
Reverse proxies stripping traceparent
Sampling too aggressive
Instrumentation library not loaded

For .NET, ensure:

OTEL_PROPAGATORS=tracecontext,baggage
No middleware overwrites headers

3. “Collector is dropping spans.”

Check Collector logs:

kubectl logs deploy/jaeger-inmemory-instance-collector -n monitoring

Look for:

batch processor timeout
queue full
exporter failed

Fixes:

Increase batch processor size
Increase memory limits
Add more Collector replicas
Use a more performant storage backend

📈 Scaling the Collector

The OpenTelemetry Collector is extremely flexible, but scaling it requires understanding its architecture.

Horizontal scaling

You can run multiple Collector replicas behind a Service. This works well when:

Apps send OTLP over gRPC (load‑balanced)
You use stateless exporters (e.g., Tempo, OTLP → another Collector)

Vertical scaling

Increase CPU/memory when:

You use heavy processors (tail sampling, attributes filtering)
You export to slower backends (Elasticsearch, Cassandra)

Pipeline separation

For large systems, split pipelines:

Gateway Collectors – Receive traffic from apps
Aggregation Collectors – Apply sampling, filtering
Export Collectors – Write to storage

This isolates concerns and improves reliability.

🗄️ Choosing a Production Storage Backend

The demo uses memstore, which is perfect for local testing but not for production. Real deployments typically use:

1. Tempo (Grafana)

Highly scalable
Cheap object storage
Great for high‑volume traces
No indexing required

2. Elasticsearch

Mature
Powerful search
Higher operational cost

3. ClickHouse (via Jaeger or SigNoz)

Extremely fast
Efficient storage
Great for long retention

4. Cassandra

Historically used by Jaeger v1
Still supported
Operationally heavy

For most modern setups, Tempo or ClickHouse are the best choices.

🔐 Security Considerations

1. TLS everywhere

Enable TLS for:

OTLP ingestion
Collector → backend communication
Jaeger UI

2. mTLS for workloads

The Collector supports mTLS for OTLP:

Prevents spoofed telemetry
Ensures only trusted workloads send data

3. Network policies

Lock down:

Collector ports
Storage backend
Jaeger UI

4. Secrets management

Use:

Kubernetes Secrets (encrypted at rest)
External secret stores (Vault, SSM, Azure Key Vault)

Never hardcode credentials in Collector configs.

🧪 Sampling Strategies

Sampling is one of the most misunderstood parts of tracing. The wrong sampling strategy can make your traces useless.

Head sampling (default)

Simple
Fast
Drops spans early
Good for high‑volume systems

Tail sampling

Makes decisions after seeing the full trace
Better for error‑focused sampling
More expensive
Requires dedicated Collector pipelines

Adaptive sampling

Dynamically adjusts sampling rates
Useful for spiky workloads

Best practice

Start with head sampling, then introduce tail sampling only if you need it.

🌐 Multi‑Cluster and Multi‑Environment Patterns

As your platform grows, you may need:

1. Per‑cluster Collectors, shared backend

Each cluster runs its own Collector, exporting to a central storage backend.

2. Centralized Collector fleet

Apps send OTLP to a global Collector layer.

3. GitOps per environment

Structure your repo like:

environments/
  dev/
  staging/
  prod/

This series includes only dev, but the structure supports adding staging and prod easily.

Each environment can have:

Different sampling
Different storage
Different Collector pipelines

🧭 Final Thoughts

Jaeger v2, OpenTelemetry, and GitOps form a powerful, modern observability stack. Across this series, you’ve built:

A Jaeger v2 deployment using the OpenTelemetry Collector
A .NET application emitting traces with zero code changes
A GitOps workflow that keeps everything declarative and self‑healing
A production‑ready understanding of scaling, troubleshooting, and hardening

This is the kind of architecture that scales with your platform, not against it. It’s simple where it should be simple, flexible where it needs to be flexible, and grounded in open standards.

In the previous parts of this series, we focused on instrumentation, the OpenTelemetry Operator, and the Collector. Now we shift gears and build the GitOps architecture that will manage everything, platform components, workloads, and environment‑specific configuration in a clean, scalable, production‑ready way.

This is where the project becomes a real platform.

All manifests, ApplicationSets, and configuration used in this series are available in the companion GitHub repository

🎯 What We’re Building in Part 4

By the end of this part, you will have:

A multi‑environment GitOps structure
A clean separation between:
- Platform components (cert-manager, OTel Operator, Collector)
- Application workloads (demo-dotnet)
A split ApplicationSet model:
- One ApplicationSet for Helm‑based platform components
- One ApplicationSet for plain‑YAML platform components
- One ApplicationSet for application workloads
Matrix generators that produce environment named instances e.g. dev-cert-manager, dev-collector, dev-demo-dotnet, if another environment was added e.g. staging then they would be staging-cert-manager, staging-collector, staging-demo-dotnet
Sync waves to enforce deterministic ordering
Namespace isolation per environment
Environment‑specific overrides via environments/{{.environment}}/values/

This is the architecture used by real platform teams running GitOps at scale.

📁 Repository Structure

A clean repo structure makes GitOps easier. This series uses:

argocd/
  app-of-apps.yaml
  applicationset-platform-helm.yaml
  applicationset-platform.yaml
  applicationset-apps.yaml

platform/
  cert-manager/
  opentelemetry-operator/
  collector/

apps/
  demo-dotnet/

environments/
  dev/
    values/
      platform-values.yaml
      apps-values.yaml

This structure is intentionally simple, scalable, and DRY.

🌱 The App-of-Apps Root

Argo CD starts with a single root Application:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/<your-repo>
    targetRevision: main
    path: argocd
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: false
      selfHeal: false
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true

This Application discovers and applies all ApplicationSets inside argocd/.

🏗️ Splitting Platform Components:

Not all platform components are created equal:

cert-manager → Helm chart
opentelemetry-operator → Helm chart
collector → plain YAML

Trying to force everything through Helm creates errors and unnecessary complexity.
So we split the platform into two ApplicationSets:

1. applicationset-platform-helm.yaml

Manages cert-manager + OTel Operator.

2. applicationset-platform.yaml

Manages the Collector.

This keeps the repo clean and avoids Helm‑related errors

🧬 Matrix Generators: env × component

Each ApplicationSet uses a matrix generator:

One list defines environments (dev, staging, prod)
One list defines components (e.g., cert-manager, operator, collector)

This series includes only a dev environment, but the structure supports adding staging and prod with no additional changes

Argo CD multiplies them:

(dev × cert-manager)
(dev × operator)
(dev × collector)
(staging × cert-manager)
...

This produces a clean, predictable set of Applications per environment.

⏱️ Sync Waves: Ordering Matters

Platform components must deploy in the correct order:

Wave	Component
0	cert-manager, opentelemetry-operator
1	collector
3	workloads

This ensures:

CRDs exist before the Operator starts
The Collector exists before workloads send telemetry
Workloads deploy last

🌍 Environment-Specific Overrides

Each environment has its own values e.g.

environments/dev/values/platform-values.yaml
environments/staging/values/platform-values.yaml
environments/prod/values/platform-values.yaml

Only dev is included in this series, but the pattern scales to additional environments easily.

This keeps platform definitions DRY while allowing environment‑specific behaviour.

🚀 The Result

By the end of Part 4, you have:

A fully declarative, multi‑environment GitOps architecture
Clean separation of platform vs apps
Deterministic ordering via sync waves
Environment‑specific overrides
Namespace isolation
A scalable pattern for adding new apps or environments

This is the foundation for everything that follows in the series.

Coding With Taz

Coding With Taz

Thoughts and things I learn

Tag: Argo CD