Distributed Tracing aims to provide better observability into distributed systems and microservices for purposes of performance monitoring and troubleshooting issues.
Distributed Tracing
Distributed Tracing aims to provide better observability into distributed systems and microservices for purposes of performance monitoring and troubleshooting issues.
Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.
Source: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
How it works in a nutshell
Distributed Tracing works by collecting the various entry and exit points and useful intermediate data and metrics done by a request until the final response is served to the requesting end. Some Distributed Tracing systems collect this information fully automatic while some other require manual instrumentation of code.
When entering a system, the request is usually assigned a unique Trace ID. This ID is then propagated to any participating systems. Information gathered this way is sent to some sort of backend collecting the data. The collector then aggregates the data via the Trace ID, thus showing the full request as it passed through the distributed system.
Metrics usually included are request time, latency, errors, status codes, etc. but not limited to this.
Open Source implementations:
Several Open Source implementations for Distributed Tracing exist:
-
A single distribution of libraries for metrics and distributed tracing with minimal overhead that allows you to export data to multiple backends.
-
Vendor-neutral APIs and instrumentation for distributed tracing
-
Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data.
-
Jaeger, inspired by Dapper and OpenZipkin, is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems
There is also a W3 working group aiming to standardize context propagation across various Distributed Tracing systems:
Because Distributed Tracing is crucial for application performance monitoring, most APM vendors adopted it in one way or another. Notable APM vendors offering Distributed Tracing are AppDynamics, DynaTrace, Instana, Lightstep or New Relic.