Tracing
Tracking requests and messages as they flow through distributed services to understand system behavior.
Distributed tracing generates a trace of all requests and messages flowing through your system, even across multiple services. By knowing a trace ID, you can see what operation published an event, what the event triggered, and how long each step took.
A trace is a group of spans sharing the same trace ID. Each span represents a single operation: an HTTP request, a database query, processing a message, or publishing one. Spans have a start time, end time, key/value attributes, and a parent span ID that connects them into a tree.
In event-driven systems, tracing is especially valuable. When a single user action triggers a chain of events across multiple services, it's hard to understand what happened without seeing the full picture. Tracing connects all those steps into one timeline.
The standard tool for this is OpenTelemetry, which provides a vendor-neutral API for recording spans and exporting them to backends like Jaeger, AWS X-Ray, or GCP Cloud Trace. In Go, starting a new span is straightforward:
ctx, span := otel.Tracer("").Start(ctx, "ProcessOrder")
defer span.End()
The context carries the trace through function calls. When you pass it to a Pub/Sub publisher, the trace propagates to the subscriber on the other side.
The trade-off is data volume. Tracing every request generates a lot of data, so sampling is used to keep costs manageable. Even with sampling, tracing is one of the most effective tools for understanding how distributed systems behave in production.
References
- Increasing Cohesion in Go with Generic Decorators — Discusses cross-cutting concerns like logging, metrics, and tracing that need to happen for each request. Shows how generic decorators can handle these operations without polluting business logic.
- Introducing Watermill - Go event-driven applications library — Mentions distributed tracing as one of the essential middleware features for modern message-driven services, alongside metrics, poison queues, and retrying.
- The Go libraries that never failed us: 22 libraries you need to know — Includes a section on metrics and tracing, recommending opencensus-go for adding tracing to HTTP endpoints, gRPC endpoints, and SQL databases using middleware and decorator patterns.
- Shipping an AI Agent that Lies to Production: Lessons Learned — Describes using tracing to understand how the AI system works in production. The team reused existing tracing infrastructure by adding spans and propagating them, showing how tracing reveals which parts run concurrently.
- Distributed Transactions in Go: Read Before You Try — Notes that observability, including metrics and tracing, is essential for event-driven systems, especially when monitoring the queue of waiting messages.
- Event-Driven Architecture: The Hard Parts — Emphasizes that observability is essential for debugging async systems. Recommends having tracing and correlation IDs in place before adopting event-driven architecture, as debugging without them is almost impossible.
- When you shouldn't use frameworks in Go — Discusses wrapping metrics and tracing libraries behind internal interfaces. Since tracing calls are scattered across the codebase, hiding them behind your own interface makes it much easier to replace the implementation later.
- OpenTelemetry Concepts