We all understand that proper data analytics is crucial to the success of an organization. But what if your analytics can do more than help you troubleshoot current problems? Splunk is building a future where data analytics proactively solve problems before they occur.
Data is essential to success and innovation for modern organizations. However, no commercial vendor has an effective single instrument or tool to collect data from all of an organization’s applications.
However, there is an open source framework: OpenTelemetry. By providing a common format of instrumentation across all services, OpenTelemetry enables DevOps and IT groups to better understand system behavior and performance.
Last week, Splunk’s Spiros Xanthos joined us on Dev Interrupted to explain OpenTelemetry - and to understand OpenTelemetry, we first need to understand Observability.
Observability is the practice of measuring the state of a system by its outputs, used to describe and understand how self-regulating systems operate. Increasingly, organizations are adding observability to distributed IT systems to understand and improve their performance and enable teams to answer a multitude of questions about these systems’ behavior.
Managing distributed systems is challenging because of their high number of interdependent parts, which increases the number and types of potential failures. It is hard to understand problems in a distributed system’s current state compared to a more conventional, standard system.
“It’s very, very difficult to reason about a problem when it happens. Most of the issues we’re facing are, let’s say, ‘unknown, unknowns’ because of the many, many, many, failure patterns you can encounter.” - Spiros Xanthos, from the Dev Interrupted Podcast at 3:02
Observability is well suited to handle this complexity. It allows for greater control over complex modern systems and makes their behavior easier to understand. Teams can more easily identify broken links in a complex environment and trace them back to their cause.
For example, Observability allows developers to approach system failures in a more exploratory fashion by asking questions like “Why is X broken?” or “What is causing latency right now?”
Telemetry data is the output collected from system sources in observability. This output provides a view of the relationships and dependencies within a distributed system. Often called “the three pillars of observability”, telemetry data consists of three primary classes: logs, metrics, and traces.
Logs are text records of events that happened at a particular time; a metric is a numeric value measured over an interval of time, and a trace represents the end-to-end journey of a request through a distributed system.
Individually, logs, metrics, and traces serve different purposes, but together they provide the comprehensive detailed insights needed to understand and troubleshoot distributed systems.
OpenTelemetry is used to collect telemetry data from distributed systems in order to troubleshoot, debug and manage applications and their host environment. In addition, it offers an easy way for IT and developer teams to instrument their code base for data collection and make adjustments as an organization grows. For more information, Splunk has an in-depth look at OpenTelemetry.
“In terms of activity, it is the second most active project in CNCF (Cloud Native Computing Foundation), the foundation that essentially started with Kubernetes. So it’s only second to Kubernetes and it’s pretty much supported by every vendor in the industry. And of course, ourselves at Splunk are big supporters of the project. And we also rely on it for data collection.” -- from the Dev Interrupted Podcast at 16:47
Since the announcement of OpenTelemetry 2 years ago, it has become highly successful.
On the Dev Interrupted podcast, Spiros discussed how in his role as the VP of Observability and IT OPS at Splunk, he has seen OpenTelemetry grow to become an industry standard that Splunk relies upon for data collection. He highlighted three key benefits of OpenTelemetry:
Prior to the existence of OpenTelemetry, the collection of telemetry data from applications was significantly more difficult. Selecting the right instrumentation mix was difficult, and vendors locked companies into contracts that made it difficult to make changes when necessary. Instrumentation solutions were also generally inconsistent across applications, causing significant problems when trying to get a holistic understanding of an application’s performance. Conversely, OpenTelemetry offers a consistent path to capture telemetry data and transmit it without changing instrumentation. This has created a de-facto standard for observability on cloud-native apps. Enabling IT and developers to spend more time creating value with new app features instead of struggling to understand their instrumentation.
Prior toOpenTelemetry, there were two paths to achieving observability: OpenTracing or OpenCensus, between which organizations had to choose. OpenTelemetry merges the code of these two options, giving us the best of both worlds. Plus, with OpenTelemetry’s backwards compatibility with OpenTracing and OpenCensus there are minimal switching costs and no risk to switching.
With OpenTelemetry developers can view application usage and performance data from any device or web browser. Now, it’s easy and convenient to track and analyze observability data in real-time.
However, the main benefit to OpenTelemetry is having the knowledge and observability you need to achieve your business goals. By consolidating system telemetry data, we can evaluate if systems are properly functioning and understand if issues are compromising performance. Then, it’s easy to fix the root causes of problems, often even before service is interrupted. Altogether, OpenTelemetry results in both improved reliability and increased stability for business processes.
With increasingly complex systems spread across distributed environments, it can be difficult to manage performance. Analysis of telemetry data allows teams to bring coherence to multi-layered ecosystems. This makes it far easier to observe system behavior and address performance issues. The net result is greater efficiency in identifying and resolving incidents, better service reliability, and reduced downtime.
OpenTelemetry is the key to getting a handle on your telemetry, allowing the comprehensive visibility you need to improve your observability practices. It provides tools to collect data from across your technology stack, without getting bogged down in tool-specific deliberations. Ultimately, it helps facilitate the healthy performance of your applications and vastly improves business outcomes.
Listen here if you want to a deeper dive into the topics of OpenTelemetry and Observability - and how Splunk leverages them.
With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No sales people allowed. Join the community >>