Skip to main content

Command Palette

Search for a command to run...

Distributed Tracing: The Missing Piece of Your Observability Stack

Published
3 min read
Distributed Tracing: The Missing Piece of Your Observability Stack
S
Building Nova AI Ops — The AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.

When Logs and Metrics Aren't Enough

You have great dashboards. Your log aggregation is solid. But when a user reports "the checkout page is slow," you still spend 30 minutes jumping between services trying to find the bottleneck.

That's the gap distributed tracing fills.

What Tracing Actually Shows You

A trace is a complete picture of a single request as it flows through your system:

User Request → API Gateway → Auth Service → Product Service → DB → Cache → Response
                  5ms          12ms           45ms       120ms  3ms
                                                          ^
                                              This is your bottleneck

Without tracing, you'd see:

  • API Gateway: latency looks fine
  • Auth Service: latency looks fine
  • Product Service: latency is HIGH but why?

With tracing, you see the exact DB query inside Product Service that's taking 120ms.

Getting Started with OpenTelemetry

OpenTelemetry is the standard. Here's a minimal setup:

# Python example with Flask
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)

# Auto-instrument everything
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db.engine)

That's it. Three auto-instrumentations cover 80% of what you need.

Custom Spans for the Other 20%

Auto-instrumentation gives you HTTP calls and DB queries. Add custom spans for business logic:

tracer = trace.get_tracer(__name__)

def process_order(order):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)

        with tracer.start_as_current_span("validate_inventory"):
            validate_inventory(order.items)

        with tracer.start_as_current_span("charge_payment"):
            charge_payment(order.payment_method, order.total)

        with tracer.start_as_current_span("send_confirmation"):
            send_email(order.customer_email)

Sampling Strategy

You can't trace every request in production. Well, you can, but your bill will be astronomical.

# otel-collector-config.yaml
processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of requests

  tail_sampling:
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Always keep slow requests
      - name: slow-requests  
        type: latency
        latency: {threshold_ms: 1000}
      # Sample 5% of everything else
      - name: default
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Tail sampling is the key. It lets you keep 100% of interesting traces and only 5% of boring ones.

The Three Queries That Matter

Once you have tracing data, these three queries solve 90% of debugging:

1. "Show me the slowest traces in the last hour"
   → Finds performance regressions

2. "Show me traces with errors, grouped by service"
   → Finds which service is failing

3. "Show me traces for user X's request at time T"
   → Reproduces specific customer issues

Common Mistakes

  1. Not propagating trace context — If service A calls service B but doesn't pass the trace ID, you get broken traces
  2. Over-sampling in production — Start at 1-5%, increase as needed
  3. Not adding business context — Adding user.id, order.id, etc. to spans makes traces actually useful
  4. Ignoring async operations — Queues break trace propagation unless you explicitly pass context

If you want AI-powered trace analysis that automatically finds bottlenecks, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD Founder & CEO, Nova AI Ops. https://novaaiops.com

More from this blog

N

Nova AI Ops Blog — SRE, Observability & Incident Response

58 posts

Honest, practical writing on SRE, observability, and incident response from the team building Nova AI Ops — the AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.