Observability has become crucial for modern applications. Distributed tracing, in particular, is essential for diagnosing issues within complex transactions. OpenTelemetry has emerged as the indispensable standard in this field. This article explores the fundamentals of distributed tracing, its implementation, and its impacts.
Distributed Traces: An Essential Pillar
In a context where applications are increasingly complex, it is crucial to be able to observe each transaction in detail to quickly identify problems and ensure continuous evolution of the platform. Among the three main pillars of observability—logs, metrics, and traces—traces are often less known but equally essential.
Unlike logs, traces allow tracking the performance of transactions in production and identifying contention points. They also provide the ability to follow synchronous transactions (such as HTTP or SQL requests) or asynchronous ones (such as JMS or Kafka messages).
History and Evolution
In 2010, Google published a case study on their use of Dapper1: their distributed tracing infrastructure. This platform addressed a pressing need to correlate different transactions with each other and facilitate their end-to-end visualization in increasingly distributed and complex platforms.
Thanks to this work, open-source specifications and projects came into being. Initially, one of Dapper’s creators, Ben Sigelman, created the open-source project OpenTracing, which was later integrated into the OpenTelemetry project.
The ecosystem continues to grow. Beyond the major cloud providers offering their solutions, there are many open-source or proprietary products that cover all or part of the scope: Jaeger, Grafana, or Elastic APM.
A summary can be found at the end of this article.
Some Basic Principles
The various mechanisms related to traces revolve around the following concepts:
- SPANS
- TRACES
- Context Propagation
SPANS
SPANS represent an operation such as a method call, a REST call, or a database query. They are identified by a header (SPAN_ID) and can have a parent-child relationship (e.g., a method called by another method) through the declaration of a header (PARENT_SPAN_ID).
TRACES
Traces are a collection of SPANS that represent an end-to-end transaction.

Context Propagation
How can operations across different applications and processes be correlated? To address this question, all the concepts mentioned below are automatically propagated.

In this way, the called services automatically have the information needed to retrospectively link them to the calling services. To ensure the portability of this information, various tools and infrastructures rely on the W3C Trace-Context standard2.
The Architecture
The typical architecture to implement looks like this:

In green, you will find the modules necessary for processing, storing, and displaying traces. All application modules (shown here in blue) send TRACES and SPANS to the OpenTelemetry collector. This collector then stores the data in the database (shown here as Grafana Tempo).
Implementation and Data Broadcast
TRACES and SPANS are created, distributed, and sent by OpenTelemetry, which is compatible with many platforms3: Java, NodeJS, Python, Go, or Rust.
Two transmission protocols are available: HTTP and GRPC.
When integrating into your programs, three implementations are possible: using the API (and coding these features), using the library for a “zero code” approach, or instrumentation.
Depending on the programming languages, you have several options available.
Each option has its advantages and disadvantages, which need to be evaluated based on your technical and organizational context.
The use of a library offers a quicker start and is essential for native compilation (Java with GraalVM, Go, or Rust).
Instrumentation, on the other hand, involves installing an agent that will be passed as an argument to the software at startup.
An example in Java:
$ export JAVA_OPTS=-javaagent:/opt/agent/javaagent.jar
$ java $JAVA_OPTS -jar myapp.jar
In my opinion, this option provides more flexibility in the software lifecycle.
Indeed, beyond the fact that it works “out of the box,” it allows you to decouple the lifecycle of the functional deliverable from that of the agent. For example, if the agent needs to be updated, there is no requirement to re-deliver the business deliverable.
In both cases, it will still be necessary to configure the properties needed for the dissemination of SPANS and TRACES through environment variables:
OTEL_SERVICE_NAME: api-gateway
OTEL_EXPORTER_OTLP_ENDPOINT: http://collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL: grpc
OTEL_RESOURCE_ATTRIBUTES: "source=agent"
Aggregation
You might have noticed the URL mentioned in the above configuration. Indeed, all observability-related streams, particularly traces, pass through a collector. OpenTelemetry provides a default one. There are other implementations as well, such as Grafana Alloy.

This tool can be compared to an ETL (Extract, Transform, Load)4. It provides an interface for all observability-related sources, transforms and aggregates them, and finally sends them to a backend (e.g., a database).
Here is a deliberately simplistic configuration to collect traces and store them in the Grafana Tempo database:
// (1)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
// (2)
processors:
batch:
// (3)
exporters:
otlp:
endpoint: tempo:4317
// (4)
extensions:
health_check:
pprof:
zpages:
// (5)
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
Explanation:
- Declaration of interfaces where data is transmitted by all applications
- Declaration of actions to be performed
- Declaration of output interfaces (here, the database accessible via the URL tempo)
- Declaration of various extensions
- Declaration of the action pipeline to be executed
Now, let’s imagine we want to deepen this process. For example, if we wish to systematically exclude certain URLs (e.g., /health or /metrics), it is possible to do so within the collector through tail-sampling in the collector:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
decision_cache:
sampled_cache_size: 500
policies:
[{
name: test-policy-9,
type: string_attribute,
string_attribute: {key: http.url, values: [\/health, \/metrics], enabled_regex_matching: true, invert_match: true}
}]
Once the processor is declared, you just need to add it in the pipeline declaration to activate it:
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [batch,tail_sampling]
exporters: [otlp]
For more details on the available options, you can consult the official documentation and the GitHub repository.
Storage
Traces and SPANs are automatically broadcasted and aggregated on a platform (we will delve deeper into implementation later). This platform allows for storing, visualizing, and correlating all the information. Among the solutions available in the market, there’s the open-source Grafana platform, which covers the entire spectrum.
For instance, the Tempo database allows for the storage of traces, and Grafana enables querying and visualizing those traces.
Display
The added value of OpenTelemetry and observability, in general, becomes evident when all the data is presented in a report. Below is an example with Grafana. Through what is known as a “Flame Graph,” you can break down all method calls as well as the various transactions between different stakeholders, whether through REST API calls, database queries, or asynchronous messaging.
Unlike APM (Application Performance Monitoring) solutions, which are used only in dedicated environments (e.g., environments dedicated to performance testing), this display provides a snapshot of transactions executed in production.

To optimize the display and provide more ease and flexibility in error analysis, it’s also possible to correlate logs and traces. To do this, it’s necessary to add the SPAN_ID and TRACE_ID identifiers to the logs.
For example:
{"sequenceNumber":0,"timestamp":1719910676210,"nanoseconds":210150314,"level":"INFO","threadName":"kafka-binder-health-1","loggerName":"org.apache.kafka.clients.NetworkClient","context":{"name":"default","birthdate":1719907122576,"properties":{"applicationName":"easypay-service","instance":"easypay-service:9a2ac3f0-c41e-4fcd-8688-123993f1d5db"}},"mdc": {"trace_id":"8b277041692baa8167de5c67977d6571","trace_flags":"01","span_id":"13ff9e44be450b8e"},"message":"[Consumer clientId=consumer-null-1, groupId=null] Node -1 disconnected.","throwable":null}
Thus, in the visualization tool, we can link traces and logs to have more contextual information.
Compatibility
OpenTelemetry is the de facto standard for transmitting observability-related data, particularly traces.
The main solutions on the market are compatible with it.
If you wish to further enhance observability in your platforms, here is a non-exhaustive list of market solutions offering aggregation and visualization of distributed traces.
Here is an overview of various market solutions that include trace management in their portfolios. They are all compatible with OpenTelemetry.
Note: Solutions that do not offer an agent or collector rely on OpenTelemetry.
Performance
A common question is the impact on performance. It is undeniable. Depending on the environment, technology, or the number of SPAN per transaction, it can affect the expected quality of service to varying degrees.
To meet execution time requirements, it is necessary to conduct performance tests as early as possible to measure the impact. Based on the results, it may be necessary to perform trace sampling.
Through this, we may process only a percentage of requests (e.g., 10%). There are two possible sampling mechanisms: “head sampling,” applied at the agent or application level, and “tail sampling,” applied in the collector. The first is mandatory if your application is subjected to high volumes. This will prevent the creation of SPAN, network calls, and storage.
At Worldline, we conducted several tests on performance impacts. On Java applications, on average, we encounter an impact of up to 15% on response times. Like with logs, this must be weighed against the benefits that traces can bring in error analysis.
Conclusion
You will note that OpenTelemetry offers many features. Among these, traces provide a complementary view to logs and a more complete vision of platforms. However, they should not be confused: they will not replace logs. They highlight distributed transactions and provide more contextual information about an error.
Moreover, one of the main advantages is that it is easily implementable from development. For example, in Java, adding an agent at the JVM startup captures traces without modifying the code. This mechanism is very useful for observing applications more finely, whether they are “cloud native” or “LEGACY.”
Finally, if you want to explore all aspects related to observability in more detail, I conducted a workshop on this topic with my colleague David PEQUEGNOT. It is available at the following address: https://worldline.github.io/observability-workshop . You will find the source code of the examples presented in this article.
- https://research.google/pubs/dapper-a-large-scale-distributed-systems-tracing-infrastructure/ ↩︎
- https://www.w3.org/TR/trace-context/ ↩︎
- https://opentelemetry.io/docs/languages/ ↩︎
- https://en.wikipedia.org/wiki/Extract,_transform,_load ↩︎