The typical model of business needs rethinking. Traditionally, businesses run in a rather industrial structure, almost militaristic. There are layers upon layers of management, with large gaps between the people who do the work and those who control the strategy. While this can work well in certain sectors, like manufacturing, it’s not ideal for a more innovative company.

So we talked to Bob Ritchie, VP of Software at SAIC, about an alternative way to structure business: the team-of-teams model. In this model, the leadership of the company creates smaller teams that manage themselves. And instead of presenting specific targets, the leadership gives each team a problem to solve. That can range from managing our customer service to making a new product.

“A top heavy and top-down micro-management ecosystem is just not what resonates today with knowledge work and thought work that an art form like software development is,” Bob says. “So the team of teams model presents a different concept. Instead of having this hierarchical command and control, the leadership strategy pivots to creating an environment where there’s a shared vision and a shared mission.” -On the Dev Interrupted Podcast at 5:10

With more autonomy, teams are happier, more productive and work much more efficiently. But what do companies need to do to switch to this model?

Give autonomy through a shared vision

The first step is to make sure that the leadership team has a clear vision. What are you trying to achieve? This needs to be simple and summarize the ultimate aim of the company. Once you have that vision, everything else can begin to fall into place. You can allow teams to find their own way to an answer, which might be a solution you never would’ve dreamt of. Just make sure to give each team a set budget.

“Teams are granted a level of autonomy that then lets them define and discover their own purpose in where they fit in that vision,” Bob says. “Oftentimes it then provides invaluable feedback on how that vision needs to be altered based on what they’re seeing as opposed to that historical: I’m-just-being-told model.” -On the Dev Interrupted Podcast at 5:47

This autonomy is key to the team-of-teams model. When you give creative and innovative people freedom to explore a problem, they’re much more likely to find a novel approach.

Give problems, not tasks

When you’ve brought together bright minds and talent, there’s no need to set specific tasks. You simply give the team a goal: a problem to solve. With small teams, they can easily organize themselves and make sure that they’re working productively. They might not solve it how you originally intended, but it’ll get solved.

“The Team of Teams model gives you that flexibility and I’m not telling you what to do, I’m giving you a problem to solve,” Bob says. “When it comes to execution in a dynamic landscape, Team of Teams is almost always better.”

Sure, in some situations like the medical world, there’s a definite correct answer. Things must happen in a set way. But Bob adds: 

“In the software world, I can’t think of a case where anyone knows the right answer … To say definitively: Build me exactly this in exactly this time and this will be your guaranteed result.” -On the Dev Interrupted Podcast at 19:18

Keep only four levels of hierarchy

But if you’re only going to give people objectives, and not set tasks, you need to make sure that individual employees are never more than four steps away from the CEO. Too many layers in between the worker and the CEO causes problems. So if you start to get too many levels, it’s time to start breaking your teams down into smaller groups.

“There has to be that cohesion of vision and purpose, and as you add layers between the individual contributors on the team to that CEO’s vision, you start to dilute the messaging,” Bob says. “So when I say: ‘there’s a problem, go solve it.’ They have a frame of mind and you know what our organization is striving towards … It really prevents that communication breakdown.” -On the Dev Interrupted Podcast at 15:39

Invest in your teams

Once you have your teams set up and can trust them to get on with a task, it’s time to start investing in them. Train them up. Help them grow as individuals and workers. Do that, and the whole team will improve.

“The foundational responsibility of leaders is to create an environment where your teams can thrive,” Bob says. “So I think continual learning is such an important dimension … If I don’t have the opportunity at work to find some level of mastery in a craft, I’m going to seek an opportunity where I can go get that.”

This is another reason why the old model doesn’t work. It makes people cogs in the machine, who don’t get those opportunities to master their craft and feel fulfilled.

“If you’re not, as a leader, investing in those teams to stay as sharp as possible, you’re doing a disservice to your teams. Eventually, your team skill sets are going to erode,” Bob says. “Carve out time for your folks to not only have access to content, but actually immerse in it.” -On the Dev Interrupted Podcast at 20:50

Let teams self-police

When teams are set up correctly, and have a good mix of skills, they’ll choose their own leaders. Perhaps through a vote. They’ll also often decide among themselves whether someone needs more training or needs to leave the team for good.

“The team self-polices to some degree. So if something gets escalated, it’s only in the cases where the team hasn’t been able to self-adjudicate,” Bob explains on the Dev Interrupted Podcast at 8:44.

They’ll often elect their team leader, too. Which is good if someone wants to step back from that leadership role for a time or give someone else a chance to prove themselves. All these things are easier in the team-of-teams model. 

Stop looking for the perfect person

Another advantage of this model is that you don’t need to be looking for someone with all the skills. It’s often much easier to find an individual that slots neatly into a team, or five people that form a new team, than to find that one perfect person.

“Maybe it’s not the perfect person, but it’s a perfect fit on this team because of personalities and principles and values,” Bob says. “Even if they don’t become that perfect person that I was looking for, they’re still going to be a valuable contributor to that team.” -On the Dev Interrupted Podcast at 32:31

It also makes it much easier to look for people who might need a little training, but you can always develop into a much stronger candidate. This opens up the pool of talent you have available to you.

Hear the full talk

This advice only scratches the surface of how organizations can make their business more efficient and productive. You can find out much more about the team-of-teams model and how it applies to business by listening to our podcast.

The Weekly Interruption is a newsletter designed for engineering leaders, by engineering leaders. We get it. You're busy. So are we. That's why our newsletter is light, informative and oftentimes irreverent. No BS or fluff. Each week we deliver actionable advice to help make you - whether you're a CTO, VP of Engineering, team lead or IC  - a better leader.

It's also the best way to stay up-to-date on all things Dev Interrupted - from our podcast, to trending articles, Interact & our community Discord

Get interrupted.

Flow can mean many things but when it comes to workflow it usually refers to that feeling, discussed by Mihaly Csikszentmihalyi, when you enter a state of intense focus and lose yourself in an activity. 

Video games are a great example. They take advantage of this feeling to keep you immersed, which is why it’s so easy for gamers to “lose time” and just get wrapped up. The same feeling usually drives your most productive and best work.

When you manage developers, their workflow should be treasured and valued. That’s why, to improve developer focus, it’s vital to avoid weighing them down with minor interruptions or non-urgent pings. 

“Flow is characterized as this experience where the task that you're doing is perfectly matched to the skills that you have.” -Katie Wilde on the Dev Interrupted Podcast at 7:51

1. Acknowledge that it take 23 minutes for devs just to get into flow

Did you know that it takes 23 minutes to get into a flow state? For some people it takes even longer. That means that for every question, disruption, email, and interruption that you or your coworkers are subjected to, it could be half an hour of productivity down the drain. We talked to Katie Wilde, VP of Engineering at Ambassador Labs, about how she manages workflow

“Say you got a Slack ping, and you're like, “oh, I'll just ask a question.” How long does it take you to find the thread again? What's that total interrupt time? It's 23 minutes…that's been measured.” -on the Dev Interrupted Podcast at 11:11

2. Defrag dev calendars

Some interruptions are unavoidable but many of them aren’t. Planning your calendar in a way that works around the needs and workflows of your team is necessary to maximize everyone's productivity. 

For instance, scheduling meetings on days when weekly meetings already occur can help preserve focus time by not disrupting other working days. 

Devs need to communicate with their managers on what times they have available away from normal workflow and then it’s up to engineering leaders to plan around those schedules. As a dev leader, you have to look at your devs’ calendars, not your own, and react accordingly. 

“If you're a manager, when you're scheduling, don't look at your calendar, and then find a time and then see where you can slot the engineer in…look at the engineer's calendar and see, where can you tack the meeting on that it is after another meeting, or it is maybe at the start of the day, the end of the day… and ask them!” -Katie Wilde on the Dev Interrupted Podcast at 12:31

3. Suck it up - schedule your work around focus time

When managing large numbers of devs, it can seem like a chore to work around many different schedules or attempting to get meetings done only on specific days. We asked Katie what her trick to juggling so many different calendars and meetings was, and she had one thing to say: “Suck it up.”

Devs are the backbone of software production and it’s important to prioritize their productivity whenever possible. To help them stay on task and be able to really focus on their work, they need to have meetings planned around their day - not yours.

Providing consistency for your devs - meeting them when they are ready, available, and focused - helps them maintain a flow state and maximize productivity. But more than that, it’s the right thing to do. Devs want to build cool stuff, not have their days ruined by their own calendars.   

Katie says it best:

“That might mean that, as the manager, you have a little bit weirder hours. I hate to say this, but kind of suck it up… There's no way to get around that.”-on the Dev Interrupted Podcast at 13:23

Watch the full interview-

If you would like to hear more about how managers can work around a developers schedule and other great insight from Katie Wilde, check out the full podcast on your favorite podcasting application, Apple Podcasts, Spotify, Stitcher, YouTube

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is inspired by Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

Discover Our Most Popular Podcasts
Join the Dev Interrupted discord

In a typical manufacturing company, a supply chain is the chain of companies that you rely on to make your product. For example, a mobile phone manufacturer buys processor chips from a supplier. That supplier needs to buy a part from another manufacturer. And that manufacturer relies on yet another company for the raw metal.

But what is the software supply chain? And how do you keep it secure? We spoke with Kim Lewandowski, co-founder and head of product at Chainguard, to explain the details.

Your software supply chain is more complex than you think

The software supply chain can be complicated. Mainly because it’s difficult to know how far it reaches. Take a simple example: If you use Salesforce to keep track of your customers, you store your customers’ data on Salesforce’s servers. Not a problem, surely? But Salesforce could have a breach. And what about the servers themselves? Those servers might run on Windows. If that has a security bug, hackers have another way in. How about the software that Salesforce uses to host its website? If that is hacked, you have yet another breach.

 

“When I think of the software supply chain, it’s all the code and all the mechanics and the processes that went into delivering that core piece of software at the end,” Kim explained. “It’s all the bits and pieces that go into making these things.” -On the Dev Interrupted Podcast at 11:28

Keeping the software supply chain secure involves checking who has keys

The important part of keeping your supply chain secure is making sure that you track down what you’re using. And checking that they’re secure and reliable. Every new third party can be a potential problem. If you don’t do your due diligence, you won’t know what risks you’re taking.

As Kim explained, a favorite analogy of hers is thinking about doing construction work on your own home.

“You have a contractor. Well, they need keys. They have subcontractors. You give the keys out to all their subcontractors. Who are they? Where are they from? What materials are they bringing into your house?” -On the Dev Interrupted Podcast at 12:09

The more third party tools you use, the more out of control it can become

It all comes down to accountability. It can easily start spreading rapidly. One third-party tool that you use to create your software might rely on five separate third parties. And you don’t know what code they’ve got hidden under the hood. Your keys are suddenly all over the place.

The only way to keep it under control is to remind yourself to check and to do regular audits of the services you use. Kim believes it’s helpful to think of every new tool as a package coming to your home.

“How is your package getting to your house?” Kim said. “What truck is it riding on and who is driving those trucks?” -On the Dev Interrupted Podcast at 12:44

Get the full conversation

If you’d like to learn more about the software supply chain, and how to make sure that yours is secure, you can listen to the full conversation with Kim over on our podcast.

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is inspired by Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

Discover Our Most Popular Podcasts
Join the Dev Interrupted discord

Over the last ten years, technology has become more sophisticated. Faster. Smaller. More powerful. But it isn’t just our technology that’s evolving at a rapid pace. Our culture, attitudes and politics are all changing, too.

So what could the next ten years look like? How might businesses change to keep up with technology? We spoke with Jason Warner, managing director at Redpoint Ventures, to get his thoughts on the matter.

“Ten years is an interestingly long, but also short time horizon,” Jason explained. “It’s likely we’ll see a complete company cycle, maybe two macroeconomic cycles.” -On the Dev Interrupted Podcast at 10:29

1. Organizations will invest more in compliance and security

There have been a lot of large changes in recent years. People are working from home. Political tensions are high. And almost every device collects data about us. In all these cases, security is important. Securing our businesses, our national secrets, and our private lives.

It all leads to an inevitable conclusion. Jason believes that chief compliance officers will become commonplace, even in small companies. Protecting data is going to become a primary concern, for governments, businesses and people. Because, as the world gets more digital, we’re going to see more and more cyber attacks.

“Trends that I see happening are an increased awareness and investment in things like compliance and security. I think that if companies don’t have a chief compliance officer now, they likely will in the future,” Jason said. “I think it’s interesting when you see the geopolitical environment of how we might have to invest in more sophisticated tooling for national security. But more than that, it’s like understanding that we’re no longer a single micro-geo unit called the United States.” - On the Dev Interrupted Podcast at 11:03

2. Companies will focus on loyalty and subscriptions over one-off sales

The standard business model is outdated. In the past, technology companies sold software, they gave customers the software and that was the end of the transaction. But now, it’s more about building communities and regular interaction with your customers. It’s about subscriptions, regular payments or even donation models, seen on popular platforms like Twitch. Software isn’t a product any more. It’s a service.

But almost every company these days is a technology company. Just look at what’s happened to the taxi industry. The model has completely changed, simply because the technology has evolved. The old model won’t completely disappear, but we’ll see more and more industries move into a subscription model as new technology takes over.

“Selling is about adoption first and selling second. Someone’s got to reach for you first,” Jason explained. “Then, they’re going to find a value problem, then they’re going to want to give you money if they’re finding utility out of you.” On the Dev Interrupted Podcast at 11:21

3. Hardware is, and always will be, just as important as software

With every new innovation, we place more demands on the hardware we’re using. The more advanced our software becomes, the more powerful our hardware must be. But right now, most  companies rely on international trade to build key components. With tensions rising, it’s likely that we’ll see companies begin to bring these resources closer to home, securing their supply chain in the process.

“There’s interestingly a lot more emphasis on investing in hardware again,” Jason said. “And America in particular owning its hardware manufacturing, which I think is obviously good.” -On the Dev Interrupted Podcast 11:41

Watch the full interview

If you’re interested in what else Jason had to say about the next ten years, and what challenges society faces, you can watch the full podcast on our site.

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is inspired by Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

At Netflix, we don’t just think about productivity - we engineer it. There’s an entire team within Netflix dedicated to productivity. I lead the Develop Domain along with my Delivery and Observability Domain peers, and together, we make up Productivity Engineering.

I recently sat down with the Dev Interrupted podcast to discuss all things productivity, how I run my team, and how other managers should view employee success. Here’s how we think about it at Netflix:

Can productivity be engineered?

In short, yes! Productivity is not a generic term for team performance or a perfunctory buzzword used during team meetings. The productivity team is an actual organization. The work we do is foundational to Netflix’s development teams. Productivity Engineering lives within the broader, central Platform organization.

The role of the Productivity Engineering team is simple: we exist to make the lives of Netflix developers easier. Abstracting away the various “Netflix-isms” around development, delivery, and observability, productivity allows devs more time to focus on their domain of expertise. 

“We are sort of like the nerds’ nerds, if you will, enabling them to use our platforms and tools so that the work that they're doing is focused on studio and streaming, without thinking about everything that's under the hood.” - On the Dev Interrupted Podcast at 2:31

With the recent addition of Gaming to the list of Netflix’s pursuits, the resulting focus becomes even more important.

Practically speaking, it’s the role of Productivity Engineering to help with things like coding, testing, debugging, dependency management, deployment, alerting, monitoring, performance, incident response, to name a bunch. Netflix utilizes the concept of a “paved road,” the frameworks, platforms, apps, and tools we build and support to keep our devs rolling. The idea is to keep workflows streamlined and enable developers to operate as efficiently and effectively as possible. If the road ahead is cleared of obstacles, you’re going to get to where you need to go faster and with support along the way. 

It’s also about helping developers enjoy the ride. To abuse another metaphor, a sound engineering experience should be like dining at a fine restaurant. If done right, you rarely remember the waitstaff, have a hard time finding something you like, or worry about how they prepared the food; you simply enjoy the experience. If Productivity Engineering is doing their job, they act as the restaurant and waitstaff with developers as the customer, providing nothing short of a beautiful end-to-end experience. 

Measuring Outcomes vs. Output

Measuring all of that productivity can be hard, and there’s no one unicorn measurement to rule them all. Hence, developer productivity teams should focus on impact and outcomes. Above all, Netflix focuses on customer satisfaction. Our philosophy is that while how something is delivered is important, the impact of what’s delivered is ultimately of greater importance. 

"If you're running around a track super-fast, but you're on the wrong track, does it matter? So really, what are you delivering? How you're delivering is important. But if that thing that you're delivering is ultimately doing what you want it to do, that's the most important thing." - On the Dev Interrupted Podcast at 5:05

In this model, the outcome always wins over output or activity. For instance, standard productivity deployment metrics (DORA) as applied to our customers become an important proxy for measuring our success. Key Performance Indicators (KPIs) for productivity are viewed as a reflection of a team’s performance as it relates to customer satisfaction.

I’m a big fan of the SPACE framework, developed by Nicole Forsgren, for precisely this reason. How are our customers doing in terms of Satisfaction, Performance, Activity, Communication, and Efficiency? The answer to those questions reflects how we’re doing as a Productivity organization.

"This is our strategy, these are our hypotheses around, how we're going to improve our customers' productivity. Are those things paying off? And if you can't measure them in some way, who knows? Right? So yeah, we're getting a little more hardcore about this." - On the Dev Interrupted Podcast at 24:17

Key metrics provide productivity teams with a holistic view of performance by establishing benchmarks. Understanding that everything needs to be viewed within the proper context, it’s difficult to improve as an organization if nothing is measured or tracked. 

Comparing Productivity 

Comparing developers’ productivity across teams is a thorny subject at best and downright dangerous for team morale at worst. As the old saying goes, “Comparison is the thief of joy” or what I typically say, “comparisons lead to unhappiness”, or with my kids “eyes on your own paper!”. 

The productivity teams at Netflix take a contextualized view of dev teams rather than relying solely on raw data. Every project is different, the customer base is different, the use case is different, personas are different, and where a team is within the software development life cycle is different.

It’s a basic understanding that comparing apples to oranges is not good math. A team that is just starting out and building something new, is going to look very different than a team with a mature product. By recognizing this, it becomes almost impossible to rank teams against each other because very rarely, if ever, will teams be doing the same thing, in the same space, the same way, with the same people. 

Even a measurement of an outcome pertaining to customer satisfaction (CSAT) is not straightforward. At Netflix and across the industry, we’ve found that satisfaction for internal teams skews lower than satisfaction for customer-facing teams.

The reason? Teams within Netflix are their own harshest critics. When attempting to gauge the performance of an internal team vs a customer-facing team, it’s understood that the internal team is almost always going to score lower on satisfaction, even if both teams are equally effective. 

Context is everything. Measuring productivity means being mindful of context. 

Pushing Productivity 

Any company that wants to be successful must understand how to measure its success. Productivity doesn’t count for much if an organization is not moving towards desired outcomes. 

By viewing productivity as more than just a concept or a raw set of data, the hard-working teams at Netflix have turned productivity into an actual apparatus. It is a living, breathing team of human beings whose devotion to empathetic efficiency improves customer satisfaction and dev team quality of life. I am incredibly proud to lead these teams, and I sincerely hope the work we do inspires other organizations to improve their developers’ experience.

And if you want to be as productive as Netflix, remember that metrics are only as good as their context! 


If you enjoyed this article and you would like to learn more about the work that I do at Netflix, I invite you to come join me at INTERACT on April 7th

This will be the second time that I have sat down for a panel discussion hosted by Dev Interrupted. I love being a member of the Dev Interrupted community because they are such an amazing resource. If you are a team lead, engineering manager, VP or CTO looking to improve your team, come to INTERACT and check out the community - I promise you will learn something.

Pretend you are watching your favorite show on Netflix: Sit back, relax & watch as I share the stage with other amazing engineering leaders from places like Slack, Stack Overflow, American Express, Outsystems, Drata & many more.

>Register Here<

Chaos Engineering might sound like a buzzword - but take it from someone who used to joke his job title was Chief Chaos Engineer (more on that later) it is much more than buzz or a passing fad - it’s a practice. 

The world can be a scary place and more and more companies are beginning to turn to Chaos Engineering to proactively poke and prod their systems and in doing so are improving their reliability and guarding against unexpected failures in production and unplanned downtime. 

During my career I dealt with my fair share of outages, including one that caught me mid-song during a bout of karaoke and far too many that woke me up at 02:00. As the co-founder and CTO at Gremlin, I do my best to make sure no other engineers have to suffer sleepless nights worrying about their product. 

But the question remains, what is Chaos Engineering and where did it come from?

A Short History

The spiritual predecessor to Chaos Engineering is often called by a much more widely recognized name - disaster recovery. The focus when this practice was introduced is much the same as today: proactively suss out production problems by injecting failure. 

Netflix’s Chaos Monkey is probably the most well publicized Chaos Engineering tool as it arguably kickstarted the adoption of Chaos Engineering outside of large companies, but this has led to the erroneous belief that Netflix invented the practice. In fact, the practice was already widely in use amongst the titans of technology. 

Over a decade ago during my time as a Lead Software Engineer at Amazon, we implemented several crude practices designed to inject failure into our systems. The most rudimentary of which was employed by a man called Jesse Robbins, who earned the nickname “Master of Disaster” by running through data centers pulling out cables. 

Let’s just say the practice has evolved a lot since those early days and your data center cables are much safer these days.

What is Chaos Engineering?

“What Chaos Engineering really is, is the art, if you want to call it that, of introducing controlled chaos.” - 2:16 on the Dev Interrupted podcast

At its core, Chaos Engineering is a disciplined approach of identifying potential failures before they have an opportunity to become customer facing outages. 

It is a practice that lets you safely test your assumption about how your systems will behave under duress by actually exercising resilient mechanisms in a controlled fashion. You literally "break things on purpose" to validate and build resiliency. The end goal of Chaos Engineering is not to inject arbitrary failure into a system, but rather to strategically inject turbulence to enhance the stability and resiliency of your systems.

How Chaotic is Chaos Engineering?

I always tell people that Chaos Engineering is a bit of a misnomer because it’s actually as far from chaotic as you can get. When performed correctly everything is in control of the operator. That mentality is the reason our core product principles at Gremlin are: safety, simplicity and security. True chaos can be daunting and can cause harm. But controlled chaos fosters confidence in the resilience of systems and allows for operators to sleep a little easier knowing they’ve tested their assumptions. After all, the laws of entropy guarantee the world will consistently keep throwing randomness at you and your systems. You shouldn’t have to help with that.

How do I Start?

One of the most common questions I receive is: “I want to get started with Chaos Engineering, where do I begin?” There is no one size fits all answer unfortunately. You could start by validating your observability tooling, ensuring auto-scaling works, testing failover conditions, or one of a myriad of other use cases. The one thing that does apply across all of these use cases is start slow, but do not be slow to start.

What I mean by this is to start testing across just a few nodes versus impacting your entire fleet. We refer to the impacted area as the “blast radius” and we highly recommend starting with a small blast radius (the number of systems impacted) and increasing it over time.

By starting small you allow yourself to gain confidence in both the experiments you are running and your systems. Of course your risk tolerance is also a factor of how large a blast radius your organization will use. 

For instance, a large banking institution with millions of customers has a much lower risk tolerance than a tech startup with a couple hundred customers. In that case, they would want to run experiments in a programmatic way and would need to be very explicit about communicating to the rest of the organization what tests are going to be run and when to avoid any unplanned 2am or 3am disasters. 

Eventually you want to get to the point where all of this is automated, a process we refer to as “continuous chaos.” Starting small with automation could be something as simple as taking out a single node; then taking out five nodes; then ten; and so on. Eventually you automate the process at a level you are comfortable with.  

“Ultimately you want to be able to handle any of this random chaos being thrown at you, because that's what the world is, it's entropy, it's degradation” - 7:35 on the Dev Interrupted podcast

No Tolerance for Downtime

When I founded Gremlin, it was just myself and my co-founder developing the first iteration of the product. The business looked very different then and I jokingly referred to myself as the “Chief Chaos Engineer” responsible for implementing code that was mostly used by enterprise companies. Many of these companies came to us because they had reliance thrust upon them by the US government or they had top-down reliability standards and they wanted a tool to help them shore up their systems. 

As the company began to evolve, so did the customer base. These days it’s not just Fortune 500 companies that care about reliability, it’s everybody. Planned downtime is a relic of days gone by. It is no longer acceptable to espouse planned maintenance windows as part of development lifecycles and customers don’t have the patience for products they rely upon to spend any time unavailable. Companies recognize this dynamic - and it’s not a hard one to miss. 

Seemingly our appetite for technology has gone up exponentially while our ability to stomach downtime has drastically decreased. Customers expect that your product is always working, always running. If your product is down because of outages then there are ten other similar products waiting in the wings to take their money. 

Making Lives Better

Visibility is high these days and companies don’t need the publicity that comes with making any unforced errors, let alone to be subject to errors not of their making. No one wants to be blown up on Twitter because their product isn’t working or because one of their downstream dependencies or their cloud provider had an unexpected outage. 

By preparing for the worst, we can be at our best as an industry and can be prepared when disaster eventually comes knocking. That’s why when an unexpected outage occurs or there is a production failure customers will never even know it happened. 

I often joke that we are the engineers’ engineers because many of us know that feeling of being jolted from a dream at 03:00 by our pagers, groggily wiping our eyes and whipping out the laptop to go dig through a sea of monitoring dashboards and logs. It’s not fun and it’s exactly why I founded Gremlin. Because there is a better way to approach operations than merely sitting back on our haunches and waiting for the next outage. Chaos Engineering not only helps to protect against the randomness of the world, but also teaches people how to build more reliable software. And if enough people build more reliable software, we build a more reliable internet.

_____________________

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is inspired by Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

A good SRE engineer will tell you your service is never down. A great SRE engineer will tell you that’s not what you should be measuring. In fact, they’ll tell you their job is customer service. 

Site Reliability Engineering (SRE) has grown immensely popular with many of the world’s largest tech companies, like Netflix, LinkedIn and Airbnb employing SRE teams to keep their systems reliable and scalable.

Along the way, SRE engineers have become one of the most sought after engineering roles in tech. 

The role is traditionally understood as ensuring that services are reliable and unbroken, but reliability and uptime aren’t perfect metrics. Perhaps what organizations should be asking themselves is what their customers think of their service. 

Wandering down to your engineering department and asking your SRE team about customer satisfaction is a good place to start. 

Their answer just might surprise you. 

History of SRE

In practice, Site Reliability Engineering has been around for a while. In the past its functions were covered by roles that had names like production ops, disaster recovery, testing or monitoring. The rise of cloud computing facilitated a need for more engineers in production. The complexity only grew as more organizations transitioned from monolithic infrastructures to distributed microservices. 

Modern Site Reliability Engineering originated at Google in 2003 with the work of Benjamin Treynor, who is seen as the “father” of what we now simply call SRE. Treynor, who coined the term, was a software engineer placed in charge of running a production team. With the goal of making Google’s website as reliable and serviceable as possible, he asked that his team spend half their time on operations tasks so they could better understand software in production. This team would become the first-ever SRE team.

Ben Treynor said, I'm paraphrasing, ‘[SRE] is essentially like throwing a software engineer at an operations problem’, right? Because you come from that developer mindset, that design and, you know, you think about all of these things. So think about it as a developer but apply it to an operational type of problem.” - Brian Murphy on the Dev Interrupted podcast at 4:26

Why not uptime?

So why shouldn't you be too concerned about your uptime metrics? In reality SRE can mean different things to different teams but at its core, it’s about making sure your service is reliable. After all, it’s right there in the name. 

Because of this many people assume that uptime is the most valuable metric for SRE teams. That is flawed logic. 

For instance, an app can be “up” but if it’s incredibly slow or its users don’t find it to be practically useful, then the app might as well be down. Simply keeping the lights on isn’t good enough and uptime alone doesn’t take into account things like degradation or if your site’s pages aren’t loading. 

It may sound counterintuitive, but SRE teams are in the customer service business. Customer happiness is the most important metric to pay attention to. If your service is running well and your customers are happy, then your SRE team is doing a good job. If your service is up and your customers aren’t happy, then your SRE team needs to reevaluate.

A more holistic approach is to view your service in terms of health. 

The Four Golden Signals

As defined by Google, these are the four golden signals of SRE. If these can be managed effectively, then you probably have a healthy system. 

Establishing system health

“The best way to get started is just measuring stuff, you know, just getting the baseline of what's healthy, what's not healthy, what looks like health, and then you can start working from there.” - Brian Murphy on the Dev Interrupted podcast at 10:49

It can be difficult to know whether or not your organization should consider forming an SRE team, or what your next steps are if you’ve already made the decision. 

Again, think of your decision in terms of a holistic approach, not just your uptime. If you have high uptime, that’s fantastic, but what you should be establishing is a benchmark. 

Using the four golden signals to guide you, establish what you think a healthy system should look like and set your benchmark. Keep measuring over time and you will begin to see the areas that are good or require more work. 

These measures will help inform all of your future decisions. Perhaps your organization is ready to roll out new features or make choices around expanding your service. 

Critically, the health you establish provides insights into customer happiness. If things look good you probably have happy customers. 

Internal customers

When done right SREs aren’t just making customers happy, they’re making the lives of developers easier too. Nothing is worse than having to stop because there’s a problem in production. Good SRE teams can shield dev teams by focusing on major hotspots.

If the fires are being managed before they are out of control, it allows developers to keep pushing out features. It even gives them the freedom to keep breaking things, if necessary!

When things do break, or require a slowdown, a dialogue can occur. A good SRE understands that the developer who wrote a piece of code understands it better than anyone. The model for good internal customer service is an SRE who brings in a developer, gives them ownership of the code they created, and offers to help them fix it.

Happy customers are the best customers

Whether you already have an SRE team or are thinking about forming one, remember to think beyond the engineering - think about the customer. 

Ask yourself if your customers are happy and if you would describe your service as healthy. Remember to think about your own teams as well, your developers will thank you for it. 

_____________________

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is based on an episode of Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

 

Continuous Delivery isn’t about how fast you can deliver, it’s about the outcome your delivery achieves. Bryan Finster, author of the 5-minute DevOps series and founder of the DevOps Dojo, joined our Dev Interrupted Discord community to answer your questions about outcome-based development, continuous delivery, and why failing small is better than failing fast. 

Bryan is currently a Distinguished Engineer at Defense Unicorns but has also worked for Walmart as a systems analyst and eventually became a staff software engineer for Walmart Labs. He had previously appeared on the Dev Interrupted Podcast to further talk about these subjects as well as the most common pitfalls dev teams find when trying to optimize their delivery process. Listen to the episode here:

This Community AMA took place on January 8, 2021 on the Dev Interrupted Discord.

Necco-LB: 📢📢 Community AMA📢📢   @everyone 

Topic: Outcome-based Development with @BryanF (Bryan Finster)

Bryan, thanks for joining us today!

Bryan Finster: Thanks for having me!

col: Bryan... great quote. "A developer is a business expert who solves problems with code." Thank you. Tremendous concept.

Bryan Finster: Thanks. That's who we are. We aren't Java spewing legos. If we don't understand the business, the code won't.

Rocco Seyboth: YES!! @col Love it. @oriker says "a business decision is made with every line of code"

Bryan: Exactly. How does this change improve the bottom line. Even more, how does it improve the lives of our customers?

Necco-LB: We really enjoyed having you on the podcast to talk about Outcome-based development and what continuous delivery should be trying to achieve. I was hoping you could explain to use what Outcome-based development means?

Bryan: It's just focusing on the outcomes. It's pointless to focus on how we do things if the outcomes are poor. It's also about Hypothesis Driven Development. The act of defining the expected value before we attempt to deliver it and then measuring for that value. Instrumenting the application to see how close we get so we can adjust. I frequently see people just being feature factories, pounding out changes that no one needs. That just costs money and increases support. We should be deliberate about what we do and say "no" when the value isn't obvious.

Cocco: When it comes to delivering value to the customer sooner, what things do you commonly see teams worrying about that they perhaps shouldn't (or not worry about, when they should?)

Bryan: "I can't release this! It's not feature complete!" No, get the incomplete change out there and make sure it doesn't break anything.

Necco-LB: You mentioned during the podcast that Pride is the best metric ever. Can you explain that a little bit?

Bryan: If I own the business problem, own the solution, own how to make it better, own the outcomes and see people getting value from my work, then I have pride in what I do. I want it to be good. I want it to be secure and stable and I want to continuously improve it.

Necco-LB: When you talk about outcome-based development you often talk about the things that need to happen before hands touch the keyboard. What are some of those things?

Bryan: We need to understand the value we are trying to deliver and we need to define how we expect to deliver that value at the detail level. It's not enough to write a vague user story. We need testable outcomes that we agree should deliver that value. Behavior Driven Development is the most effective tool I've found for that. We also need to make sure we aren't trying to deliver ALL of the value at once. What if we are wrong? We usually are, statistically. So, what is the smallest, highest value thing we can deliver to find out? Sometimes the right answer is to stop at that point. Invest in the outcomes, not the plan or the work.

_____________________

Read the unedited AMA and join in the discussion in the Dev Interrupted Discord here! With over 2000 members, the Dev Interrupted Discord Community is the best place for Engineering Leaders to engage in daily conversation. Join the community >>

Dev Interrupted Discord, the new faces of engineering leadership

_____________________

Cocco: What patterns/trends do you see in teams who can deliver the outcomes they want? (Are there common factors in teams you've seen that move from struggling -> successful?)

Bryan: Yes. Actual continuous delivery and product ownership. They can deliver small changes daily and they have ownership of what those changes are. They have the safety to challenge things without fear and they are not pushed so hard that there is no time to think of better ideas. Software development is a mental activity, not typing.

Necco-LB: You work with a lot of different teams at the DevOps Dojo. What are some of the most common pitfalls preventing a team from optimizing their delivery process?

Bryan: They are given the wrong problems to solve. They are asked to solve stupid problems like "how many changes did you make today?", "How many stories did you complete this sprint?", They don't know how to work as teams because they are incentivized to work in silos. So, requirements are poorly defined, testing suffers, speed suffers. They need to be solving the business problem. What is measured will change. Be careful what and how you measure.

Necco-LB: What are some first steps a team can take if they want to become more outcome focused?

Bryan: Focus on the business problem and get close to the user. Empathize with them and what value they need. This really applies to anything. If you don't respect your customer, you won't need to worry about them for very long.

Necco-LB: What is the role/responsibility of the developer in this outcome-based development model?

Bryan: On a good development team you have engineers and product ownership. Engineers ship working solutions. They know they are working because they tested them, delivered, them and observed that their tests were accurate.

Rocco Seyboth: In 5 Minute DevOps you talk about observing what high performing teams do then modeling other teams to the same process and behavior... how do you reconcile that with the belief that every team is different and should have the flexibility to do things their own way?

Bryan: Actually, I advocate against cookie cutter templating of teams in that post. We should standardize on improving outcomes.

Necco-LB: Friends, that's just about the top of the hour. Bryan has a real job that needs to get done, but feel free to keep the questions coming asynchronously throughout the day - he'll be popping in and out to answer them. Bryan - thank you so much for joining our community today and answering our questions!

Bryan: Just some contact links to leave and I want to thank everyone for the conversation. I love talking about these topics.
https://www.linkedin.com/in/bryan-finster/

https://bdfinst.medium.com/

_____________________

Starved for top-level software engineering content? Need some good tips on how to manage your team? This AMA is based on an episode of Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

Following a recent interview on the Dev Interrupted Podcast, OutSystems CEO and founder Paulo Rosado joined us to chat about his path to founding the company, advice for successful leaders, and the growing threat of technical debt. The conversation below has been edited for length and clarity. 

_____________________

Tell us about OutSystems' founding story. What inspired you to start the company?

In February 2021, OutSystems was valued at $9.5 billion dollars - but it certainly didn’t start out that way. The idea behind OutSystems was decades in the making, and its mission stems from what I observed after moving to Silicon Valley back in the mid-nineties. 

My journey in technology began when I graduated with a degree in computer engineering from Universidade Nova de Lisboa in Lisbon, Portugal and moved to the US to get my Masters in Computer Science from Stanford. Afterward, while working in Silicon Valley, I began to understand just how much of a problem technical debt was. 

While working on a very large engineering team, we were faced with tackling a gigantic project in Java and I realized the issues of releasing and maintaining code sustainably. The lack of productivity in the software development process was appalling. Fixing this problem is ultimately what motivated me to found OutSystems. 

Before founding OutSystems, there was a small company I founded and later sold, which focused on internet and intranet projects. It wasn’t a bad company, but we kept failing. Projects were never delivered on time or on budget. 

We would think to ourselves, “We’re smart. How is this possible?” Our inclination was to blame the requirements of the project, labeling the scope as incorrect and adjust from there. However, we began to realize that the companies hiring us for these projects wanted us to make changes as we were developing in response to rapidly changing environments. 

The issue we began to face was the continual accumulation of technical debt. We would reach first production and realize we had built something users didn’t want, requiring us to go back and rework the stuff we had just built. 

“We came up with this realization that the problem was not that the requirements up front were wrong. The problem was that the cost of changing wrong requirements, which are a fact of life, is very high.” - on the Dev Interrupted podcast at 6:03

 

This phenomenon was occurring in 90% of projects at the time. Things were always over budget and always late. 

Today, it’s easy to take this for granted because concepts like Agile, DevOps, CI/CD are mainstream. But at the time, you had to build software the same way you build a bridge.  

Why is technical debt a challenge for companies now? How has this problem changed?

Technical debt has become a large problem for businesses, and one that only compounds with time. Tech debt doesn’t have a singular cause - it’s the accumulation of several factors. 

Over the course of my career, I’ve seen first-hand the complexity brought about by the evolution of software development. For instance, we’ve seen an explosion of languages, paradigms and frameworks that can all be used to achieve a solution. Often these languages are dispersed with no connections between them, so tracking these dependencies requires a great deal of sophistication. 

In addition to this, turnover within the development team is a critical problem that leads to technical debt. The moment a company loses a developer, the knowledge accrued by that developer also departs the company. The hole left behind is complex, including code,  frameworks and intent behind how their systems are structured. 

It’s been my experience that a lost team member can take as much as 20% to 30% of the fundamental knowledge of a system with them. Reverse engineering their work is both time-intensive and inefficient. 

Companies have tried to corral this problem by investing in coding standards. While these constraints can help mitigate the loss of a valued developer, our research indicates turnover remains a significant problem. 

OutSystems recently released a study on the effects of technical debt. What were its findings? 

Recently, OutSystems surveyed 500 large companies around the world to examine the cost of technical debt facing businesses and uncover the challenges companies face as they confront its causes. The results from the companies surveyed were many of the same things I’ve observed throughout my career. 

It’s important to note that while the causes of technical debt have largely remained the same, the pace at which technical debt occurs has grown substantially.

And so it's a hack, right? What we call a hack at OutSystems, they did a hack to just release the software quickly. And those hacks compound into technical debt.” - on the Dev Interrupted podcast at 27:11

The survey we conducted isolated three major causes of technical debt. They are as follows: 

  1. The amount of developer frameworks. An increase in frameworks leads to an increase in technical debt. 
  2. Developer erosion. Employees leaving an organization and taking legacy knowledge with them. 
  3. Compromises in quality of architecture and code. Often caused by a shortsighted view that what needs to be done now is more important than long-term stability of the codebase.

In the past, companies believed they could buy their way out of this problem, but that strategy has proven ineffective. The reality is, the most successful companies must build the software they require to meet their business needs. 

Simply purchasing what you need doesn’t solve your problems because even purchased systems must be cobbled together, requiring unique API’s, unique UI’s, unique portals, and unique mobile applications. 

Does OutSystems play a role in helping companies cut tech debt? 

The core of what we do at OutSystems is focused on tackling those three fundamental problems. We understand that technical debt amasses slowly over time, through a myriad of decisions that appear much smaller at their onset than their totality would suggest. Once these “tiny” decisions become a major problem, they inhibit investment in current operations and future innovations. 

The increasing pressures of today’s fast-paced business environment often push companies toward decisions that spiral into technical debt. The good news is that by creating a development process that marries short-term deadlines with long-term strategic goals, it’s possible to “pay down” that debt. 

I believe that any company is capable of whittling away technical debt with the correct tools and processes, and I founded OutSystems because companies shouldn’t have to choose between building fast and building right. 

To learn more about technical debt, how to combat it, and what to expect in the future, you can download the 2021 Technical Debt Report on our website.  

_____________________

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is based on an episode of Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.

Dan is the founder of Tellspin, an on-call scheduler in Slack for DevOps and developers (https://tellspin.app). Helping workspaces reduce their contact footprint, resolve incidents faster, and regain deep focus.

Code smell is a way to describe code that hasn’t aged well and has the potential for a lot of issues.

It usually is the source of a lot of hot fixes or workarounds keeping it functional. My most common reflex is to rewrite it. However, if I’m not careful, I’ll waste an entire day and not improve anything.

After a decade of programming, here are my 7 steps to reduce code smell gradually.

Step 0: Admit there is a problem

I start to recognize my code is smelly when I start saying things like “that time only took an hour.”

I’m usually doing something simple, like adding another field to a form or another schedule for a customer. I quickly add in code because it feels like the easiest thing to do and ship the feature. There are so many other things on my plate, I don’t have time for this, I’ll say to myself.

By the 5th or 6th hour I’ve hacked the same spot, I realize, had I rewritten it sooner, I would have actually saved time. 

Step 1: Identify spots to clean

Smelly code is so disorganized.

Is it really smelly or do I just not understand it? It’s very tempting to always default to a rewrite. If I write all the code, I’ll understand it. But who is to say the next person who looks at it will?

Similar to profiling code to identify the slowest spot, I work to identify the place that smells the most. Are there sections of the code that new devs are always struggling with? Are there frequent small changes that require touching lots of different files or methods?

Creating a list of smelly code helps identify which sections of code need the most attention.

Step 2: Pick the worst spot

Smelly code is like dirty dishes.

With a stack of dishes, I’ll plug my nose until I dispose of the rotting food that’s causing the stink. It was easy to blame the whole pile, but for the most part, all of the other dishes are fairly clean. They don’t need immediate attention. The rotting smell came from something I forgot to clean off when I was in a hurry.

When there is a piece of code that’s really rotten, it’s often hidden somewhere in the pile. Maybe an abstraction went too far, spreading a hundred lines of code across dozens of files.

I keep in mind that I need to fix the worst smell; most of the other code is good enough and doesn’t need my immediate attention. 

Step 3: Resist the urge to do everything

Smelly code is never-ending.

Perhaps the hardest part of improving a code base is scoping it to one thing. It’s so liberating to finally get a chance to clean up, that I can easily take it too far. I’ll think, “While I’m at it, I might as well clean up this… oh! and that other thing needs fixing too.” 

Resist! Do not do everything. 

If I try to tackle everything, I’m not going to finish. Even more likely, it’s not going to pass code review. It’s better to do one piece at a time - ya know, like eating an elephant. 

Step 4: Make sure it’s better

Smelly code has edge cases.

Inevitably, in the process of rewriting, I discover why the code was written that way in the first place. I might even stumble across a can of worms. At that point, I realize my not-so-dimwitted co-worker wasn’t as dumb as I thought (or even more likely, I discover I was the one who wrote the code originally 🤦‍♂️).

 After learning all the edge cases, I’ll be tempted to walk away.

Step 5: Don’t immediately give up

Smelly code is messy to work with.

I’m frustrated imagining how far away the current code is from a better solution. I’ve got the code in my head, I know the edge cases, and I’ve got the context. It’s important not to give up as the solution may be right around the corner.

I keep thinking about it while I go for a walk. Maybe even take a break. Solutions often come to me while I’m on walks or in the shower.

Step 6: Use the co-worker bobblehead

Smelly code needs attention.

I steal my co-worker’s bobblehead and explain aloud what I’m doing. In the process, I figure out what I've missed or overlooked. 

If a bobble head isn’t available, I resort to using my actual co-workers. (I’m checking my assumptions by walking them through what I’m thinking step by step.)

Step 7: Publish or throw in the towel

Smelly code can improve.

At the end of my steps I have a complete solution or I’m banging my head on the keyboard. If it’s the first, I push the change and take a breath of fresh air. If it’s the second, I commit it to a branch and plan to revisit another day. Sometimes we can’t have nice things.

Rinse and repeat

The depth I go into each step changes based on complexity or how critical the code is. Sometimes I can run through each of the steps in a few minutes, other times it’s spread out over a few weeks. It really depends on what I’m working on.

Running through these steps helps me gradually improve my code. There’s nothing better than finally getting a fix for some smelly code merged and into production. Sometimes we can have nice things.

Dan Willoughby is the founder of Tellspin, an on-call scheduler in Slack for DevOps and developers (https://tellspin.app). Helping workspaces reduce their contact footprint, resolve incidents faster, and regain deep focus.

_____________________

Starved for top-level software engineering content? Need some good tips on how to manage your team? This article is inspired by Dev Interrupted - the go-to podcast for engineering leaders.

Dev Interrupted features expert guests from around the world to explore strategy and day-to-day topics ranging from dev team metrics to accelerating delivery. With new guests every week from Google to small startups, the Dev Interrupted Podcast is a fresh look at the world of software engineering and engineering management.

Listen and subscribe on your streaming service of choice today.