OpenTelemetry was born from the need to deal with today's increasingly complex and distributed architectures
The number of services running in a modern architecture has exploded in recent years. So much so, that developers and Ops teams needed new standards to collect metrics, logs, and traces to meet their performance objectives.
This guide provides a deep dive into the subject of OpenTelemetry. If you're on the lookout for a tool to analyze your OpenTelemetry data, you can get a free recommendation by completing our questionnaire.
This resource covers the following topics:
OpenTelemetry (OTEL) is an open-source Observability framework comprised of several tools, APIs, and SDKs, operating under the Cloud Native Computing Foundation (CNCF). It was formed through a merger of the OpenTracing and OpenCensus projects to help collect telemetry data (logs, traces, and metrics) from apps and services, and forward them to tools that analyze performance and behavior. OTEL supports many modern programming languages such as Java, .Net, C++, Golang, Python, Node.js, PHP, Ruby, Rust, Swift, Erlang, NestJS, etc.
To become more agile, companies are moving away from monolithic architectures to microservice-based architectures. However, a more distributed architecture does make it difficult for developers and Ops teams to understand the dependencies between services during outages. Observability, and therefore OpenTelemetry, is an approach to instrumentation for gathering actionable data on these services and systems and identifying issues faster.
Operations teams should:
Example: Let’s say you have three applications interacting with an API service that sends requests to the database. When the Ops team gets an alert that the database is down, they need to check metrics like CPU utilization and memory usage and how they are getting the requests. Then, they’ll check the logs related to the API service and any application-related logs and traces. With the help of these 3 pillars of Observability which are traces, metrics, and logs, the Ops team should be able to find out why the database is down and fix the issue accordingly.
However, at one point, there was no standardized data format for sending data to the Observability back-end. To bring about standardization two open-source projects were created by the cloud community:
And so, OpenTelemetry was formed by merging these aforementioned open-source projects.
Telemetry is used in multiple fields such as meteorology, medicine, and software development. In software development, telemetry refers to an automated process of collecting data about different systems and applications remotely and transmitting that data to a location that analyzes the performance of said systems and applications. This has multiple layers, such as APIs, canonical implementations, SDKs, and infrastructure data.
Traces, metrics, and logs are telemetry data types (i.e., the “golden triangle”). You can use OpenTelemetry to instrument, generate, collect, and export this telemetry data from your application or infrastructure to understand what is going on. Each type of data source has a specific role in application and infrastructure monitoring, which we cover in-depth below.
When a request is made by a user or application or service, a trace gives a high-level picture of what is happening. In microservice-based architectures, there will be many interdependent, services, and most operations involve multiple hops through multiple services.
In such cases, tracing allows for visibility into the end-to-end health of an application.
Different components help with tracing in OpenTelemetry:
Logs contain high-level or detailed data, such as:
These logs are written to log files, containing every event. Whereas metrics show “what” happened, logs show “why” it happened.
Ops teams usually determine how much log data to collect, for what periods, and at what level of detail, by balancing what they need to solve issues and storage costs for the logs. The Ops team can set up appropriate retention periods for log data. While it’s not easy to read all the logs when an incident happens, OpenTelemetry does help collect and send these logs to different tools for faster analysis.
Metrics provide statistical information about a system, application, or service. Metrics provide information about the measurement, the time it was captured, and any associated metadata.
With the help of metrics, you can understand key measurements for your infrastructure or application, such as:
And, can do so across different dimensions, such as:
With the help of metrics, your Ops team can monitor Key Performance Indicators (KPIs) to get a high-level picture of how systems or applications are performing. Logs can also be analyzed to generate additional metrics such as average response time, average bytes transferred, etc. And, data can also be aggregated to provide aggregated metrics such as count, sum, average, percentage, etc.
APIs enable different software components to communicate with each other using a set of definitions and protocols. In OpenTelemetry, APIs define the data types and operations for generating and correlating telemetry data such as tracing, metrics, and logs.
SDKs define configuration, data processing, and exporting concepts. They also define the requirements for the implementation of any language-specific API.
Collectors receive, process, and export telemetry data. They require a backend to receive and store the data, which can be stored in multiple formats such as OTLP (Open Telemetry Protocol), or formats that work with Prometheus, Jaeger, etc.
Collectors provide two deployment methods:
Collectors are made up of:
Exporters help in decoupling the instrumentation from the backend and provide the functionality to send telemetry to consumers. You can change the backend tool such as Prometheus, Jaeger, and Zipkin, without ever changing anything in the instrumentation, and have a wide variety of open-source or third-party tools to choose from.
Developer and operations teams used to spend significant amounts of time collecting telemetry data and trying to understand the different formats in which data is collected. This made troubleshooting any issue related to the underlying infrastructure or application complex. OpenTelemetry brought forth a standard for collecting and sending telemetry data on infrastructure or applications from complex (i.e., microservice-based) architectures. This helped developer and operations teams free up their time to prevent incidents and resolve issues.
OpenTelemetry does not provide the actual observability back-end. Ops teams are responsible for exporting data to the many available open-source and proprietary analysis tools.
Ultimately, OpenTelemetry helps developers and Ops teams by providing a pluggable architecture so that formats and additional protocols can be added easily.
Instrumentation is key. You can pick between automatic and manual instrumentation.
With this choice, you’ll need to add dependencies and then configure the OpenTelemetry instrumentation. These language-specific dependencies will add the OpenTelemetry API and SDK capabilities. Exporter dependencies may also be required.
Different configuration options are available such as:
With this choice, you’ll need to import the OpenTelemetry API and SDK, configure the OpenTelemetry API by creating a tracer, configure the OpenTelemetry SDK if you are building a service process, create telemetry data such as traces, metrics, and logs and then export the data to backend tools such as Prometheus, Jaeger, or Zipkin.
OTEL supports a wire protocol known as OTLP that can be run as a proxy or sidecar to service instances or be run on a separate host. The collector can be configured to export the data to analysis tools which can be open-source tools such as Jaeger and Prometheus or proprietary platforms such as AppDynamics and DataDog.