What is Open Telemetry

Louis-Victor Jadavji, Cofounder of Taloflow

Louis-Victor Jadavji (or "LV") is a recognized leader in the cloud services industry. He's helped 50+ digital native companies like ModusBox, Later, and NS1 choose the right cloud stack for their applications. His work has been featured in Forbes (30 Under 30 All-Star), HuffPost, The New York Times, The Globe and Mail, and Inc. Magazine.

Introduction to OpenTelemetry (OTEL)

OpenTelemetry was born from the need to deal with today's increasingly complex and distributed architectures

The number of services running in a modern architecture has exploded in recent years. So much so, that developers and Ops teams needed new standards to collect metrics, logs, and traces to meet their performance objectives.

This guide provides a deep dive into the subject of OpenTelemetry. If you're on the lookout for a tool to analyze your OpenTelemetry data, you can get a free recommendation by completing our questionnaire.

This resource covers the following topics:

What is it, and Why use it
What is telemetry data
Components
Benefits
How it Works

Get a report on the best APM and Observability provider. Tailored to your exact use case.

Get My Free Observability Evaluation

It takes 5 minutes

What is OpenTelemetry?

OpenTelemetry (OTEL) is an open-source Observability framework comprised of several tools, APIs, and SDKs, operating under the Cloud Native Computing Foundation (CNCF). It was formed through a merger of the OpenTracing and OpenCensus projects to help collect telemetry data (logs, traces, and metrics) from apps and services, and forward them to tools that analyze performance and behavior. OTEL supports many modern programming languages such as Java, .Net, C++, Golang, Python, Node.js, PHP, Ruby, Rust, Swift, Erlang, NestJS, etc.

Why OpenTelemetry?

To become more agile, companies are moving away from monolithic architectures to microservice-based architectures. However, a more distributed architecture does make it difficult for developers and Ops teams to understand the dependencies between services during outages. Observability, and therefore OpenTelemetry, is an approach to instrumentation for gathering actionable data on these services and systems and identifying issues faster.

Operations teams should:

know what data to collect;
collect it quickly address to the availability and performance issues of applications or infrastructure; and,
have the ability to correlate metrics, logs, and traces to meet team objectives for operational excellence.

Example: Let’s say you have three applications interacting with an API service that sends requests to the database. When the Ops team gets an alert that the database is down, they need to check metrics like CPU utilization and memory usage and how they are getting the requests. Then, they’ll check the logs related to the API service and any application-related logs and traces. With the help of these 3 pillars of Observability which are traces, metrics, and logs, the Ops team should be able to find out why the database is down and fix the issue accordingly.

However, at one point, there was no standardized data format for sending data to the Observability back-end. To bring about standardization two open-source projects were created by the cloud community:

OpenTracing: Provides vendor-neutral APIs and instrumentation for distributed tracing. Developers need to implement their own libraries to meet the specification.
OpenCensus: Provides a set of libraries for various languages that allow for collecting application metrics and traces, and the transferring of data to any one of the supported Observability back-ends.

And so, OpenTelemetry was formed by merging these aforementioned open-source projects.

What is telemetry data?

Telemetry is used in multiple fields such as meteorology, medicine, and software development. In software development, telemetry refers to an automated process of collecting data about different systems and applications remotely and transmitting that data to a location that analyzes the performance of said systems and applications. This has multiple layers, such as APIs, canonical implementations, SDKs, and infrastructure data.

Traces, metrics, and logs are telemetry data types (i.e., the “golden triangle”). You can use OpenTelemetry to instrument, generate, collect, and export this telemetry data from your application or infrastructure to understand what is going on. Each type of data source has a specific role in application and infrastructure monitoring, which we cover in-depth below.

Traces

When a request is made by a user or application or service, a trace gives a high-level picture of what is happening. In microservice-based architectures, there will be many interdependent, services, and most operations involve multiple hops through multiple services.

In such cases, tracing allows for visibility into the end-to-end health of an application.

Different components help with tracing in OpenTelemetry:

Tracer Provider: As the name implies, it creates/provides tracers, which is the first step. Resource and exporter initialization is included in the tracer provider’s one-time initialization. The lifecycle of the tracer provider matches the application's lifecycle in most cases. In some language SDKs, a Global Tracer Provider is initialized instead.
Tracer: Tracers create spans, which contain information about a request. Similar to a global tracer provider, a global tracer is also initialized in some languages.
Trace Exporter: Trace exporters export/send traces to a consumer which can be an OpenTelemetry collector or any backend (open source or third-party tool).
Trace Context: This contains the metadata about trace spans, and correlates between spans across services and process boundaries.

Logs

Logs contain high-level or detailed data, such as:

the timestamp of when the event occurred;
the name of the system logging that information, severity, application name, log message, etc.; and,
structured, unstructured, or plain text data.

These logs are written to log files, containing every event. Whereas metrics show “what” happened, logs show “why” it happened.

Ops teams usually determine how much log data to collect, for what periods, and at what level of detail, by balancing what they need to solve issues and storage costs for the logs. The Ops team can set up appropriate retention periods for log data. While it’s not easy to read all the logs when an incident happens, OpenTelemetry does help collect and send these logs to different tools for faster analysis.

Metrics

Metrics provide statistical information about a system, application, or service. Metrics provide information about the measurement, the time it was captured, and any associated metadata.

With the help of metrics, you can understand key measurements for your infrastructure or application, such as:

CPU usage
Memory usage
Network in
Network out
Load average

And, can do so across different dimensions, such as:

Hostname
Application name
Service
Time Period

With the help of metrics, your Ops team can monitor Key Performance Indicators (KPIs) to get a high-level picture of how systems or applications are performing. Logs can also be analyzed to generate additional metrics such as average response time, average bytes transferred, etc. And, data can also be aggregated to provide aggregated metrics such as count, sum, average, percentage, etc.

Components of OpenTelemetry

APIs

APIs enable different software components to communicate with each other using a set of definitions and protocols. In OpenTelemetry, APIs define the data types and operations for generating and correlating telemetry data such as tracing, metrics, and logs.

SDK

SDKs define configuration, data processing, and exporting concepts. They also define the requirements for the implementation of any language-specific API.

Collector

Collectors receive, process, and export telemetry data. They require a backend to receive and store the data, which can be stored in multiple formats such as OTLP (Open Telemetry Protocol), or formats that work with Prometheus, Jaeger, etc.

Collectors provide two deployment methods:

Agent: A collector instance runs with the application or on the same host as the application as a binary, sidecar, or daemon set.
Gateway: One or more collector instances run as standalone services in clusters, data centers, or regions.

Collectors are made up of:

receivers that receive data that can be push or pull-based;
processors that process the received data; and,
exporters that export/send the data which can be push or pull-based.

Exporters help in decoupling the instrumentation from the backend and provide the functionality to send telemetry to consumers. You can change the backend tool such as Prometheus, Jaeger, and Zipkin, without ever changing anything in the instrumentation, and have a wide variety of open-source or third-party tools to choose from.

Benefits of OpenTelemetry

Developer and operations teams used to spend significant amounts of time collecting telemetry data and trying to understand the different formats in which data is collected. This made troubleshooting any issue related to the underlying infrastructure or application complex. OpenTelemetry brought forth a standard for collecting and sending telemetry data on infrastructure or applications from complex (i.e., microservice-based) architectures. This helped developer and operations teams free up their time to prevent incidents and resolve issues.

How does OpenTelemetry work?

OpenTelemetry collects telemetry data in multiple steps and exports it to a back-end system using a specialized protocol, called Open Telemetry Protocol (OTLP).
OTEL tells different system components both infrastructure and application side what data to gather and how to gather it by instrumenting the code with APIs.
It then collects the data using SDKs, sends it for processing, and exports it.
Finally, it enriches the data using multi-source contextualization by reducing errors.

OpenTelemetry does not provide the actual observability back-end. Ops teams are responsible for exporting data to the many available open-source and proprietary analysis tools.

Ultimately, OpenTelemetry helps developers and Ops teams by providing a pluggable architecture so that formats and additional protocols can be added easily.

Instrumentation is key. You can pick between automatic and manual instrumentation.

Automatic Instrumentation

With this choice, you’ll need to add dependencies and then configure the OpenTelemetry instrumentation. These language-specific dependencies will add the OpenTelemetry API and SDK capabilities. Exporter dependencies may also be required.

Different configuration options are available such as:

Data Source-specific Configuration
Exporter Configuration
Propagator Configuration
Resource Configuration

Manual Instrumentation

With this choice, you’ll need to import the OpenTelemetry API and SDK, configure the OpenTelemetry API by creating a tracer, configure the OpenTelemetry SDK if you are building a service process, create telemetry data such as traces, metrics, and logs and then export the data to backend tools such as Prometheus, Jaeger, or Zipkin.

OTEL supports a wire protocol known as OTLP that can be run as a proxy or sidecar to service instances or be run on a separate host. The collector can be configured to export the data to analysis tools which can be open-source tools such as Jaeger and Prometheus or proprietary platforms such as AppDynamics and DataDog.

Get a report on the best APM and Observability provider. Tailored to your exact use case.

Get My Free Observability Evaluation

It takes 5 minutes