A good SRE engineer will tell you your service is never down. A great SRE engineer will tell you that’s not what you should be measuring. In fact, they’ll tell you their job is customer service. 

Site Reliability Engineering (SRE) has grown immensely popular with many of the world’s largest tech companies, like Netflix, LinkedIn and Airbnb employing SRE teams to keep their systems reliable and scalable.

Along the way, SRE engineers have become one of the most sought after engineering roles in tech. 

The role is traditionally understood as ensuring that services are reliable and unbroken, but reliability and uptime aren’t perfect metrics. Perhaps what organizations should be asking themselves is what their customers think of their service. 

Wandering down to your engineering department and asking your SRE team about customer satisfaction is a good place to start. 

Their answer just might surprise you. 

History of SRE

In practice, Site Reliability Engineering has been around for a while. In the past its functions were covered by roles that had names like production ops, disaster recovery, testing or monitoring. The rise of cloud computing facilitated a need for more engineers in production. The complexity only grew as more organizations transitioned from monolithic infrastructures to distributed microservices. 

Modern Site Reliability Engineering originated at Google in 2003 with the work of Benjamin Treynor, who is seen as the “father” of what we now simply call SRE. Treynor, who coined the term, was a software engineer placed in charge of running a production team. With the goal of making Google’s website as reliable and serviceable as possible, he asked that his team spend half their time on operations tasks so they could better understand software in production. This team would become the first-ever SRE team.

Ben Treynor said, I'm paraphrasing, ‘[SRE] is essentially like throwing a software engineer at an operations problem’, right? Because you come from that developer mindset, that design and, you know, you think about all of these things. So think about it as a developer but apply it to an operational type of problem.” - Brian Murphy on the Dev Interrupted podcast at 4:26

Why not uptime?

So why shouldn't you be too concerned about your uptime metrics? In reality SRE can mean different things to different teams but at its core, it’s about making sure your service is reliable. After all, it’s right there in the name. 

Because of this many people assume that uptime is the most valuable metric for SRE teams. That is flawed logic. 

For instance, an app can be “up” but if it’s incredibly slow or its users don’t find it to be practically useful, then the app might as well be down. Simply keeping the lights on isn’t good enough and uptime alone doesn’t take into account things like degradation or if your site’s pages aren’t loading. 

It may sound counterintuitive, but SRE teams are in the customer service business. Customer happiness is the most important metric to pay attention to. If your service is running well and your customers are happy, then your SRE team is doing a good job. If your service is up and your customers aren’t happy, then your SRE team needs to reevaluate.

A more holistic approach is to view your service in terms of health. 

The Four Golden Signals

As defined by Google, these are the four golden signals of SRE. If these can be managed effectively, then you probably have a healthy system. 

Establishing system health

“The best way to get started is just measuring stuff, you know, just getting the baseline of what's healthy, what's not healthy, what looks like health, and then you can start working from there.” - Brian Murphy on the Dev Interrupted podcast at 10:49

It can be difficult to know whether or not your organization should consider forming an SRE team, or what your next steps are if you’ve already made the decision. 

Again, think of your decision in terms of a holistic approach, not just your uptime. If you have high uptime, that’s fantastic, but what you should be establishing is a benchmark. 

Using the four golden signals to guide you, establish what you think a healthy system should look like and set your benchmark. Keep measuring over time and you will begin to see the areas that are good or require more work. 

These measures will help inform all of your future decisions. Perhaps your organization is ready to roll out new features or make choices around expanding your service. 

Critically, the health you establish provides insights into customer happiness. If things look good you probably have happy customers. 

Internal customers

When done right SREs aren’t just making customers happy, they’re making the lives of developers easier too. Nothing is worse than having to stop because there’s a problem in production. Good SRE teams can shield dev teams by focusing on major hotspots.

If the fires are being managed before they are out of control, it allows developers to keep pushing out features. It even gives them the freedom to keep breaking things, if necessary!

When things do break, or require a slowdown, a dialogue can occur. A good SRE understands that the developer who wrote a piece of code understands it better than anyone. The model for good internal customer service is an SRE who brings in a developer, gives them ownership of the code they created, and offers to help them fix it.

Happy customers are the best customers

Whether you already have an SRE team or are thinking about forming one, remember to think beyond the engineering - think about the customer. 

Ask yourself if your customers are happy and if you would describe your service as healthy. Remember to think about your own teams as well, your developers will thank you for it. 

_____________________

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is based on an episode of Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

 

To fight the wars of the future the US Air Force tasked a small group of software engineers with a simple job - revolutionize the way the military thinks about software development.

The group tasked with this not-so-tiny problem came to call themselves “Kessel Run” after the famed smuggling route used by Han Solo in Star Wars.

Since starting in 2017, the team at Kessel Run has expanded to include over 1,300 people across multiple locations, helping build, test, deliver, operate and maintain cloud-based infrastructure and warfighting software. These applications are used by airmen worldwide and represent the future of warfare.

That’s because the
wars of the future will be fought with software and system architecture as much as any other weapon.

 

What is Kessel Run? 

Han Solo smuggles DevOps in the Department of Defense.

“[Kessel Run] was kicked off about four years ago as a way to prove that the Department of Defense didn't have to be terrible at building and delivering software, regardless of being within the world's largest bureaucracy.” - Adam Furtado, on the Dev Interrupted podcast at 1:35

As an Airforce organization, Kessel Run delivers a wide variety of mission capabilities to warfighters around the world, utilizing industry best practices around DevOps and Agile. At the time of its inception, it represented such a radical departure from the normal way of thinking within the Department of Defense (DoD) that people joked it would have to be “smuggled” into the DoD.

That’s how Kessel Run came to earn its name - a scruffy team outfitted with a mission to upend a stodgy and cumbersome bureaucracy. 

A shift in thinking needed to start with culture. The team at Kessel Run decided to bring a startup like mentality to the behemoth that is the federal government with a goal of introducing modern software methodologies at scale. Pockets within the DoD were practings things like continuous delivery, but prior to Kessel Run, previous attempts to adopt modern software principles had largely failed. Warfighters weren’t getting the capabilities or tools they needed. 

Problem Solvers

One of the biggest institutional problems that Kessel Run was tasked with trying to improve were the Air Force’s Air Operations Centers. Spread around the world across twenty two locations, these organizations manage all the details that involve fighting an air war. Everything from strategy, to planning, to tasking aircraft to perform certain actions, to providing real time intelligence data and feedback, are handled at Air Operations Centers. 

The challenge was modernizing these centers while maintaining operational readiness and current hardware - much of which was 20 to 30 years old. All of the hardware across these locations came with its own integrated software, built from various third party sources over decades. 

To tackle this challenge the team at Kessel Run applied the principles of Gall’s Law, which states that all complex systems that work evolved from simpler systems that worked. 

By starting small and focusing on rapidly achievable solutions, they began to see the network effects of their actions. Small, precise fixes can have tremendous impacts on an organization and are less prone to failure than attempting systematic change overnight. 

 “So we knew that using Gall’s Law in history, that we needed to start small, in order to make this work. We couldn't just have a big bang approach to replace this entire system. Right? You did that by chipping away at some core parts of the system from a user functionality perspective.” - Adam Furtado, on the Dev Interrupted podcast at 13:31

Practical Success

One of the first small changes achieved by Kessel Run was with the Air Force’s air refueling program.

A remarkable acrobatic feat performed at more than 20,000 feet above ground, at speeds close to 400 miles per hour, replenishing the fuel of an aircraft is dangerous but necessary work. Everyday, fighter jets and bombers rendezvous with fuel air tankers to perform air-to-air refueling before continuing on with their mission. 

Optimizing the details of such a delicate dance would be difficult, but the folks at Kessel Run believed they could do it. First, they needed software engineers. One of the problems of developing software at the federal government is a lack of engineers. Or rather, a lack of native engineers that can be found in-house. Historically speaking, the government outsources everything to contractors. 

Scrounging the Air Force for active duty software engineers scattered across separate programs, Kessel Run was able to stitch together it’s own homegrown software engineering team.

With their mission in hand, they set to work building an initial application nicknamed “Jigsaw” to improve the air refueling process. By optimizing every aspect of the process from the timing, to the altitude, Jigsaw became an enormous success. Within a year of implementation the Air Force was saving $12.8 million a month on fuel. 

Refueling Jets is expensive.

 

 

Tiny, targeted successes like these continued. But Kessel Run was up against more than just inefficient programs. 

A New Way of Thinking

Changing company culture is notoriously difficult. Changing culture inside the world’s largest bureaucracy is as hard as it gets. 

The most difficult problem that Kessel Run had to tackle wasn’t the lack of software developers, or the difficulty of integrating third party software applications, or figuring out how to optimize and build combat applications, it was how to communicate with their peers in the DoD. 

Part of the difficulty was due to the security implications of such work. The production environments are all on classified systems, making things like cloud implementation and tooling availability difficult.  

However, navigating the business side of the DoD was always the most challenging. In the past 30 years the government has spent over a billion dollars trying to update their systems to provide the best capabilities possible to warfighters in order to prepare for a war that may never happen.

Until Kessel Run, the government didn’t have much to show for their efforts. A perception existed that new software methodologies and practices were just the next iteration of technologies that overpromised and underdelivered. It took a lot of trust to explain that doing something in a more agile way or using DevOps, would actually reduce risk and increase success for the organization.

“The problem we have is we go and talk about how deployment frequency is going to buy down risk for us. That sounds counterintuitive to everybody in the world, particularly in a military environment, where they're like, ‘What do you mean? Change is scary. I don't change stuff.’ So we're having these kind of counterintuitive conversations around why moving to this way of working is less risky and increases our chances of success.” - Adam Furtado, on the Dev Interrupted podcast at 6:39

Solving that problem came down to nothing more than old fashioned relationship building. It took years of evangelism and continued success, but eventually Kessel Run started to win the approval of the right people in the right places. 

Proof is in the Pudding

From starting as an organization with only 5 software engineers, to expanding into a program that currently has over 1300 people, Kessel Run has proven itself to be an ingenious concept: bring startup culture to an old organization in need of modern ways of thinking. 

Government has never been the place that attracted the top talent in technology, but with Kessel Run it’s become that. They provide access to the newest technologies, competing with some of the best companies in the industry. 

They do have one ace up their sleeve when it comes to hiring: fighter jets. And those are pretty cool. 

If you want to learn more about the history and story of Kessel Run, consider listening to the Dev Interrupted podcast featuring Adam Furtado, Kessel Run’s Chief of Platform. 

Dev Interrupted is a weekly podcast featuring a wide array of software engineering leaders and experts, exploring topics from dev team metrics to accelerating delivery. 

______________________________________________________________________________________________________________________________________

If you haven’t already joined the best developer discord out there, WYD?

Look, I know we talk about it a lot but we love our developer discord community. With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No salespeople allowed. Join the community >>

To call Google a titan of the tech industry would be an understatement. Their name has become synonymous with the internet itself. The very act of retrieving information from the internet - the core functionality of the internet and its most basic purpose - is known simply as “Googling” something. On their road to becoming the web’s biggest search engine and a moniker for the internet itself, Google also pioneered much of what it takes to grow a company at scale. 

On the Dev Interrupted Podcast Google Senior Engineer’s Hyrum Wright and Titus Winters, shared their lessons learned from programming at Google with LinearB Co-Founder and COO, Dan Lines. Both engineers have a deep understanding of the principles behind software development: Hyrum is semi-famous as the "Hyrum" of Hyrum's Law; while Titus is responsible for 250 million lines of code that over 12,000 developers work on.

But what lessons can we take from their interview - and their book Software Engineering at Google: Lessons Learned from Programming Over Time? How can we apply those lessons to our own projects? I’ve pulled out the core takeaways from their interview and condensed them so that any developer or company, be they responsible for 2,000 lines of code or 2,000,000, can learn something from Google’s roadmap. 

 

Why listen to Google

In spite of their enormous success and scale, Google doesn’t pretend to have all the answers; this lack of presumption is exactly the thinking that has made them the titan they are. Previous success is no guarantee of future success and no one understands this better than Google.

One thing that Google is very good at is not accepting how everyone else does it as the one true way.” - Titus Winters, on the Dev Interrupted podcast at 20:35

It’s easy to assume that events and conclusions are foregone, or that one event naturally follows another. Yet this is rarely the case. Most of the time people are working towards a specific outcome, and it’s not until later that the outcome is apparent contextually. This is true in life and software development. 

Google has spent the past couple decades approaching everything they do as trial and error, learning what does and does not work, and trying to institutionalize the things that do work. This is not a straight path.

This mindset is obvious in Hyrum and Titus’ interview. Titus uses the analogy of Lewis and Clark to explain how the software development process at Google has unfolded. 

“They say Lewis and Clark explored the Louisiana Purchase, by which we mean, they took one path out and one path back, which is not exactly mapping, but in a similar way, we're trying to give an exploration/trip-report/map.” - Titus Winters, on the Dev Interrupted podcast at 6:17

He’s admitting that Google doesn’t have all of the answers; that Google’s path isn’t the only path to success; that other paths may be superior to Google’s; and that their path may not work for every business. But, with decades of experience, Google has learned a thing or two along the way, and maybe, just maybe, we can all learn something from the path they have trailblazed.  

After all, there aren’t many companies in the world that have hundreds of millions of lines of code. 

The 3 Pillars of Software Engineering at Google

“Anyone can change a line of code or change 10 lines of code. But how about changing 10,000 lines of code or 100,000 lines of code in a reasonable time? - Hyrum Wright, on the Dev Interrupted Podcast at 17:15

With so much code to manage, Google has made maintaining their code a strategic goal. Code must be fresh and able to sustain changes to the code base for business or technical reasons. To best allow for this, they have identified 3 concepts that they believe are core to software engineering. 

1.Time

All of the hardest stuff that software engineers are going to have to deal with like skew version, backwards compatibility, issues with data storage, dependency management, upgrades and many more, are all problems created by time. Once dev teams realize this, it will change their perspective on how best to write code.  

For instance, if you are going to get rid of your code within one or two years, then you probably don’t need to worry too much about making changes or upgrades to it in the future. But if you are going to write code that will remain in use after five or ten years, then you may want to approach it differently. 

If dev teams want their code base to last, they need to think about constructing code so that it can sustain changes within an organization’s lifespan. This fundamental realization allows time to peacefully coexist with the second pillar. 

2. Scale

How your system scales is a relationship with time. Scaling isn’t a new problem and Google has been at the forefront of pushing scale their entire existence. For instance, email existed before Gmail and search engines existed before Google, but Google’s brilliance was their ability to scale these technologies better than their competitors. It’s the root of their success. 

To beat their competitors they adopted a mindset of scaling as a process - a continual evolution. 

As a company grows, all of its operations expand and that continued expansion requires more resources, which begets even more resources still. This means growth cannot occur superlinearly because if it does, eventually a company will consume all of its resources maintaining the status quo.  

The key takeaway is to make sure your codebase and software both scale sublinearly, that way if your codebase doubles or triples you won’t need six times as many engineers just to maintain your systems. (Sublinear scaling refers to team growth that occurs more slowly than the number of supported services of a company. Superlinear growth is the opposite - with team growth outpacing the number of supported services.

3. Trade-offs and Costs

After taking into consideration the best practices around time and scale, what is left is good decision making. Just as Hyrum and Titus note in their book, “in software engineering, as in life, good choices lead to good outcomes.”

However, no organization has perfect data on which to base every decision, and therefore must strive to make the best decisions they can with the most data possible. People need insights into what an organization finds impactful.

For instance, if an engineer spends a week on a project, it should probably be a project the organization considers a priority. Because if it is not, then no matter how perfect the code, it probably wasn’t the best use of the engineer’s time. Brain power should be devoted to the most difficult problems, not where a semicolon should be placed. The cost of incorrectly evaluating trade-offs is failing the 1st two pillars. 

 

Coming Home

While Hyrum and Titus may not be Lewis and Clark reborn, they have a valuable story to tell about trial and error in the Information Age.

How a company scales is likely to define how it differentiates itself from its competitors and whether or not it will be successful. A company that can scale sublinearly will thrive, all others will stagnate as victims of their own success. 

Minding the principles of these modern-day explorations into the wilderness of code will help any organization keep an eye on the most valuable resource: time. But remember just as Lewis and Clark found one path forward, they didn’t find the only path forward. 

We can all learn something from Google, but never forget the path forward is your own.


For more lessons in scaling and growth, consider registering for INTERACT, Dev Interrupted's biggest event yetThe interactive, community-driven, digital conference takes place September 30th. Designed by engineering leaders, for engineering leaders INTERACT will feature 10 speakers, 100s of engineers and engineering leaders, and is totally free. 


 

<Register Now>

______________________________________________________________________________________________________________________________________

If you haven’t already joined the best developer discord out there, WYD?

Look, I know we talk about it a lot but we love our developer discord community. With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No salespeople allowed. Join the community >>

 

Darren Murph, Head of Remote at GitLab

The office of the 20th century is a testament to design. A great deal of thought goes into the layout of a building. How are the offices laid out? Where are the elevators located? Where will teams meet? But the focus on co-located office space is quickly becoming a relic of the past. To meet the challenges of the 21st century GitLab's Head of Remote Darren Murph is pushing organizations to put just as much thought into their remote work structure as they would an office building. 

For many companies, the transition to this mindset comes with difficulty. They've shifted into remote work as a necessity, but maintain the 20th-century ‘office-first’ mindset. While this is passable and can work, it's not ultimately taking advantage of the key benefits of a virtual atmosphere. 

To take advantage of the shifting dynamics, GitLab is using their own platform to consolidate all of their virtual collaboration. Providing a single source of truth, GitLab has designed the virtual version of a central hallway where all work is funneled. This breaks down organizational siloes and enables the GitLab team to collaborate with maximum efficiency, by making sure that everything is as visible and as transparent as possible for everyone in the organization. 

A company’s ‘central hallway’ is going to look different from organization to organization, but the takeaway for all remote organizations and engineering leaders should be the importance of de-siloing information across your organization. This will encourage virtual collaboration and boost creativity. 

Meetings that Support Remote Culture

A Chief People Officer once asked Darren, “How do we make our meetings better?” His response? “Make them harder to have.” 

Darren believes that you should have as few meetings as possible because people deserve to be able to focus on their work. From this belief flows the practice of using tools like Slack or Microsoft Teams to gather consensus asynchronously, and then reserve synchronous time for meetings where only decisions are made or important status updates are shared. 

This has the effect of focusing a team’s attention which is important as teams become distributed around the globe, and time zones become a greater issue. It's far too easy for your entire day to be meeting with teams across your organization, with people coming online in various time zones to fill your day. Instead, the focus should remain on having critical day-to-day functions performed asynchronously - with meetings taking a back seat. 

In addition to focusing an organization's efforts, being thoughtful about structuring remote work also reduces meeting fatigue. We’ve all experienced being on Zoom or other video conferencing software continuously throughout the day. Not only is it inefficient and distracting, but it can lower your company morale and leave you exhausted and feeling like you didn't accomplish anything during the day.

Darren’s ideas may have seemed radical just a couple of years ago. But he and the folks at GitLab are pioneering - and thriving - in today’s remote environment. The office of the 21st century is undoubtedly going to be virtual, so remember to put as much rigor and thought into your virtual work structure as you would if you were designing a building. 

To learn more about how GitLab and other companies transitioned to remote work, check out Dev Interrupted's Remote Work Panel on August 11, from 9-10am PST.

Image showing four leaders of remote work from Dev Interrupted

Interested in learning more about how to implement remote work best practices at your organization?

Join us tomorrow, August 11, from 9am-10am PST for a panel discussion with some of tech’s foremost remote work experts. This amazing lineup features:

Dan Lines, COO of LinearB, will be moderating a discussion with our guests on how they lead their teams remotely, how the current workplace is changing, and what's next as the pandemic continues to change

 

Don't miss the event afterparty hosted in discord from 10-10:30am with event speakers Chris and Shweta, as well as LinearB team members Dan Lines and Conor Bronsdon.

______________________________________________________________________________________________________________________________________

If you haven’t already joined the best developer discord out there, WYD?

Look, I know we talk about it a lot but we love our developer discord community. With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No salespeople allowed. Join the community >>

Dev Interrupted: The New Faces of Engineering Leadership

 

 

 

Shweta Saraf, Senior Director of Engineering at Equinix
Shweta Saraf, the Senior Director of Engineering at Equinix, has a particularly interesting remote work story: she experienced a fully remote acquisition during the pandemic.

Her former employer - Packet - was acquired by Equinix, a huge company with more than 30,000 employees and over 200 locations around the world. Suddenly, the small team at Packet who were experts at remote work found themselves in the position of trying to onboard not just themselves at a new company, but onboard an organization of 30,000+ to the principles, structure, and best practices of a fully remote work environment. Because Equinix is the largest data center company in the world, operating data centers and office hubs all over the globe, the switch to remote work had to be as seamless and efficient as possible.

One of the key areas Equinix first looked to be efficient was in their meeting practices. They began what they refer to as ‘Better Way Wednesdays’ (their name for a best practice also utilized by Shopify) as a way to better inform employees and leadership. These meetings paired with a monthly business memo, capture the state of business along with key achievements, challenges or blockers, and give senior leaders KPIs and metrics.

This practice made it possible to cut down on the number of weekly status meetings, where the same information is passed on in different formats or through different levels of abstraction. The investment in this practice paid off immediately. Teams found that the “Better Way’ meetings would often only take an hour, but would save tons of time across the board. It also had the added benefit of reducing zoom fatigue. More focus time and better communication were realized by a single meeting shift.

The biggest focus Equinix implemented was asynchronous communication, because of the many time zones involved and the number of people all over the world, including engineering teams. Rather than restrict productivity to a specific set of time zones, async communication gives employees the agency to be held accountable for completing their work on their own time. Meaning that it is no longer necessary to align employees on separate continents onto the same Zoom call if that information can be transcribed in a chat app.

However, for companies where office culture is strong, with ceremonies happening in-office, it can be a learning process to adapt to working completely remote. With Packet’s experience aiding the transition for Equinix, a cross-synergy of ideas was realized. Employees from both companies found themselves questioning former agile ceremonies, such as stand-ups and retros, and whether these can be done asynchronously, or if they require a meeting at all. The merger resulted in an easier working environment for everyone.

Equinix, a company of tens of thousands of employees and hundreds of locations, transitioned to remote work successfully during the pandemic not because they were remote-friendly, but because they adopted a mindset of remote-first. Meaning that a developer on the other side of the globe could participate meaningfully and not feel left out. While not every company underwent an acquisition during the pandemic, Equinix's journey to a fully-remote organization is a familiar story for many tech companies this year.  To learn more about Equinix and how other companies transitioned to remote work, check out Dev Interrupted's Remote Work Panel on August 11, from 9-10am PST.

Interested in learning more about how to implement remote work best practices at your organization?

Join us tomorrow, August 11, from 9am-10am PST for a panel discussion with some of tech’s foremost remote work experts. This amazing lineup features:

Dan Lines, COO of LinearB, will be moderating a discussion with our guests on how they lead their teams remotely, how the current workplace is changing, and what's next as the pandemic continues to change

Don't miss the event afterparty hosted in discord from 10-10:30am with event speakers Chris and Shweta, as well as LinearB team members Dan Lines and Conor Bronsdon.

______________________________________________________________________________________________________________________________________

If you haven’t already joined the best developer discord out there, WYD?

Look, I know we talk about it a lot but we love our developer discord community. With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No salespeople allowed. Join the community >>

Dev Interrupted: The New Faces of Engineering Leadership

 

 

INTERACT is the biggest thing Dev Interrupted has done yet.

An interactive, community-driven, digital conference on September 30th - by engineering leaders, for engineering leaders. 1 day, 10 speakers, 100s of engineers and engineering leaders, all free.

1 day, 10 speakers, 100s of engineers: Dev Interrupted

Why attend?

 

Speaker Lineup

Sessions and more information to be announced over the upcoming weeks - register now to save your spot

🎉The #InteractDI Afterparty: Hosted by Dzone🎉

Immediately following the event, join our growing engineering leadership community for an afterparty in Discord hosted by Dzone. Click here to join.

______________________________________________________________________________________________________________________________________

This event wouldn’t be possible without our Presenting Sponsor, LinearB, and our Presenting Partners, DZone, and Daily.dev.

Presenting
Sponsor

About LinearB

Metrics alone don't improve dev teams. Software Delivery Intelligence (SDI) helps dev teams continuously improve by turning insight into action. Unlike top-down engineering metrics tools which become shelf-ware, LinearB is a dev-first platform and provides value to every member of the team. Development organizations using LinearB's SDI cut their Cycle Time in half after only 90 days. The result for developers is less bureaucracy, fewer interruptions, and more time to build. The result for teams is fewer process bottlenecks and accelerated delivery. Activate Software Delivery Intelligence for your dev team in 5 minutes at linearb.io.

Premier Partners

About DZone

DZone.com is one of the world’s largest online communities and a leading publisher of knowledge resources for software developers. Every day, hundreds of thousands of developers come to DZone.com to read about the latest technology trends and learn new technologies, methodologies, and best practices through shared knowledge. 

About daily.dev

The fastest growing online community for developers to stay updated on the best developer news. Together they supercharge developers’ knowledge and empower better software.

 

About Dev Interrupted

Dev Interrupted is the premier community for software engineering leadership and continuous improvement. We publish our podcast weekly, maintain a Discord Community of more than 2,000 engineering leaders, host monthly events, including our new quarterly INTERACT conference, publish articles, create videos, and much more. 

 

Everyone loves free stuff, right?

Even better when that free stuff is both fun and adds value. That’s what Dev Interrupted’s upcoming event - The New Leaders of Remote Work - by engineering leaders, for engineering leaders - is all about.

The New Leaders of Remote Work
Four remote engineering leaders join Dev Interrupted to share their insights and perspectives on the future of remote work - and how to successfully hire, onboard, and work remotely.

 

Join us from 9am-10am PST on August 11th for another great panel discussion with:

Dan Lines, COO of LinearB, will be moderating a discussion with our guests on how they lead their teams remotely, how the current workplace is changing, and what's next as the pandemic continues to change.

Want to learn from the new leaders of remote work? Then this livestreamed Dev Interrupted Panel is the event for you.

<Register here>

We're excited for the future and very thankful to have you on this journey with us. You can always reach me for feedback (or site bug reports!) via our developer Discord community or on our Twitter.

Thanks for everything -

Conor Bronsdon

Community & Content Lead, Dev Interrupted

______________________________________________________________________________________________________________________________________

If you haven’t already joined the best developer discord out there, WYD?

Look, I know we talk about it a lot but we love our developer discord community. With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No salespeople allowed. Join the community >>

Dev Interrupted: The New Faces of Engineering Leadership

 

Who says developers and dev team leads can't be the life of the party? Come see for yourself why the Dev Interrupted Discord and Podcast have emerged as the go-to community for engineering leaders. Our developer discord has grown from 0-1300 engineering leaders in just the past 8 months, and our community is attracting the attention of CTOs and VPs of companies like Netflix, GitHub, and Gitlab.

We're incredibly honored that so many people are finding this community valuable and we’re excited to keep the party rolling with our newest launch: devinterrupted.com - whether you're a developer, a dev team lead, a CTO or VP of Engineering - we want you to feel welcome.

We've built the Dev Interrupted website to serve as the central hub for our growing community. Our mission is to create the most active and engaged community of engineering leaders possible - and to give our community the opportunity to contribute and learn more.

The Dev Interrupted Site Includes:

As members of our community, we also want to give you some visibility into what's on the horizon for Dev Interrupted.

On Our Roadmap:

We're excited for the future and very thankful to have you on this journey with us. You can always reach me for feedback (or site bug reports!) via our developer Discord community or on our Twitter.

Thanks for everything -

Conor Bronsdon

Community & Content Lead, Dev Interrupted

If you haven't already joined the best developer discord out there, WYD?

Look, I know we talk about it a lot but we love our developer discord community. With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No salespeople allowed. Join the community >>

 

 

We all understand that proper data analytics is crucial to the success of an organization. But what if your analytics can do more than help you troubleshoot current problems? Splunk is building a future where data analytics proactively solve problems before they occur. 

Data is essential to success and innovation for modern organizations. However, no commercial vendor has an effective single instrument or tool to collect data from all of an organization’s applications.

However, there is an open source framework: OpenTelemetry. By providing a common format of instrumentation across all services, OpenTelemetry enables DevOps and IT groups to better understand system behavior and performance.

Last week, Splunk’s Spiros Xanthos joined us on Dev Interrupted to explain OpenTelemetry - and to understand OpenTelemetry, we first need to understand Observability. 

 

What is Observability? 

Observability is the practice of measuring the state of a system by its outputs, used to describe and understand how self-regulating systems operate. Increasingly, organizations are adding observability to distributed IT systems to understand and improve their performance and enable teams to answer a multitude of questions about these systems’ behavior.

Managing distributed systems is challenging because of their high number of interdependent parts, which increases the number and types of potential failures. It is hard to understand problems in a distributed system’s current state compared to a more conventional, standard system. 

“It’s very, very difficult to reason about a problem when it happens. Most of the issues we’re facing are, let’s say, ‘unknown, unknowns’ because of the many, many, many, failure patterns you can encounter.” - Spiros Xanthos, from the Dev Interrupted Podcast at 3:02

Observability is well suited to handle this complexity. It allows for greater control over complex modern systems and makes their behavior easier to understand. Teams can more easily identify broken links in a complex environment and trace them back to their cause.

For example, Observability allows developers to approach system failures in a more exploratory fashion by asking questions like “Why is X broken?” or “What is causing latency right now?”

What is OpenTelemetry?

Telemetry data is the output collected from system sources in observability. This output provides a view of the relationships and dependencies within a distributed system. Often called “the three pillars of observability”, telemetry data consists of three primary classes: logs, metrics, and traces. 

Logs are text records of events that happened at a particular time; a metric is a numeric value measured over an interval of time, and a trace represents the end-to-end journey of a request through a distributed system. 

Individually, logs, metrics, and traces serve different purposes, but together they provide the comprehensive detailed insights needed to understand and troubleshoot distributed systems.

OpenTelemetry is used to collect telemetry data from distributed systems in order to troubleshoot, debug and manage applications and their host environment. In addition, it offers an easy way for IT and developer teams to instrument their code base for data collection and make adjustments as an organization grows. For more information, Splunk has an in-depth look at OpenTelemetry.

Benefits of OpenTelemetry

“In terms of activity, it is the second most active project in CNCF (Cloud Native Computing Foundation), the foundation that essentially started with Kubernetes. So it’s only second to Kubernetes and it’s pretty much supported by every vendor in the industry. And of course, ourselves at Splunk are big supporters of the project. And we also rely on it for data collection.” -- from the Dev Interrupted Podcast at 16:47

Since the announcement of OpenTelemetry 2 years ago, it has become highly successful. 

On the Dev Interrupted podcast, Spiros discussed how in his role as the VP of Observability and IT OPS at Splunk, he has seen OpenTelemetry grow to become an industry standard that Splunk relies upon for data collection. He highlighted three key benefits of OpenTelemetry:

    1. Consistency

      Prior to the existence of OpenTelemetry, the collection of telemetry data from applications was significantly more difficult. Selecting the right instrumentation mix was difficult, and vendors locked companies into contracts that made it difficult to make changes when necessary. Instrumentation solutions were also generally inconsistent across applications, causing significant problems when trying to get a holistic understanding of an application’s performance. Conversely, OpenTelemetry offers a consistent path to capture telemetry data and transmit it without changing instrumentation. This has created a de-facto standard for observability on cloud-native apps.  Enabling IT and developers to spend more time creating value with new app features instead of struggling to understand their instrumentation.

    2. Simpler Choice

      Prior toOpenTelemetry, there were two paths to achieving observability: OpenTracing or OpenCensus, between which organizations had to choose. OpenTelemetry merges the code of these two options, giving us the best of both worlds. Plus, with OpenTelemetry’s backwards compatibility with OpenTracing and OpenCensus there are minimal switching costs and no risk to switching. 

    3. Streamlined Observability

      With OpenTelemetry developers can view application usage and performance data from any device or web browser. Now, it’s easy and convenient to track and analyze observability data in real-time.

However, the main benefit to OpenTelemetry is having the knowledge and observability you need to achieve your business goals. By consolidating system telemetry data, we can evaluate if systems are properly functioning and understand if issues are compromising performance. Then, it’s easy to fix the root causes of problems, often even before service is interrupted. Altogether, OpenTelemetry results in both improved reliability and increased stability for business processes. 

Why OpenTelemetry is the Future

With increasingly complex systems spread across distributed environments, it can be difficult to manage performance. Analysis of telemetry data allows teams to bring coherence to multi-layered ecosystems. This makes it far easier to observe system behavior and address performance issues. The net result is greater efficiency in identifying and resolving incidents, better service reliability, and reduced downtime.

OpenTelemetry is the key to getting a handle on your telemetry, allowing the comprehensive visibility you need to improve your observability practices. It provides tools to collect data from across your technology stack, without getting bogged down in tool-specific deliberations. Ultimately, it helps facilitate the healthy performance of your applications and vastly improves business outcomes.

Listen here if you want to a deeper dive into the topics of OpenTelemetry and Observability - and how Splunk leverages them.

With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. No sales people allowed. Join the community >>